{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Предобработка текста.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Afanasyy/colab/blob/main/%D0%9F%D1%80%D0%B5%D0%B4%D0%BE%D0%B1%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D0%BA%D0%B0_%D1%82%D0%B5%D0%BA%D1%81%D1%82%D0%B0.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Средства для предобработки текста и выделения полезной информации\n",
        "\n",
        "* Токенизация \n",
        "* лемматизация\n",
        "* стемминг\n",
        "* частотные \n",
        "* удаление стоп-слов"
      ],
      "metadata": {
        "id": "DPOAaO-ZB9w1"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Установка сторонних библиотек\n",
        "\n",
        "Сообщество разработчиков создает и совместно использует широкий спектр сторонних библиотек, их можно установить из Интернета и использовать в вашей программе.\n",
        "\n",
        "Как стандартные библиотеки, они предоставляют стандартизированные решения для многих проблем, возникающих в повседневном программировании + решения для более конкретных ситуаций.\n",
        "\n",
        "От обработки баз данных до машинного обучения и веб-разработки."
      ],
      "metadata": {
        "id": "iLDKUHH5CLW5"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Сторонние библиотеки не присутствуют по умолчанию в Python - их следует сначала установить.\n",
        "\n",
        "Это можно сделать с помощью инструмента `pip`.\n",
        "\n",
        "`pip` - это менеджер пакетов для Python.\n",
        "\n",
        "Он установлен с Python автоматически."
      ],
      "metadata": {
        "id": "hdH1DeCpDnBN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Проверить, установлен ли pip\n",
        "!pip -V"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CjlzuiWrEBAp",
        "outputId": "ea0d9b7d-a765-4513-c680-f1443956bd11"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "pip 21.1.3 from /usr/local/lib/python3.7/dist-packages/pip (python 3.7)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Вывести список уже установленных библиотек\n",
        "!pip list"
      ],
      "metadata": {
        "id": "ut6AJP4zEETO",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "1ae31d1f-0ab9-4f95-80ba-c12d66de16c7"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Package                       Version\n",
            "----------------------------- ---------------------\n",
            "absl-py                       1.0.0\n",
            "alabaster                     0.7.12\n",
            "albumentations                0.1.12\n",
            "altair                        4.2.0\n",
            "appdirs                       1.4.4\n",
            "argon2-cffi                   21.3.0\n",
            "argon2-cffi-bindings          21.2.0\n",
            "arviz                         0.11.4\n",
            "astor                         0.8.1\n",
            "astropy                       4.3.1\n",
            "astunparse                    1.6.3\n",
            "atari-py                      0.2.9\n",
            "atomicwrites                  1.4.0\n",
            "attrs                         21.4.0\n",
            "audioread                     2.1.9\n",
            "autograd                      1.3\n",
            "Babel                         2.9.1\n",
            "backcall                      0.2.0\n",
            "beautifulsoup4                4.6.3\n",
            "bleach                        4.1.0\n",
            "blis                          0.4.1\n",
            "bokeh                         2.3.3\n",
            "Bottleneck                    1.3.4\n",
            "branca                        0.4.2\n",
            "bs4                           0.0.1\n",
            "CacheControl                  0.12.10\n",
            "cached-property               1.5.2\n",
            "cachetools                    4.2.4\n",
            "catalogue                     1.0.0\n",
            "certifi                       2021.10.8\n",
            "cffi                          1.15.0\n",
            "cftime                        1.6.0\n",
            "chardet                       3.0.4\n",
            "charset-normalizer            2.0.12\n",
            "click                         7.1.2\n",
            "cloudpickle                   1.3.0\n",
            "cmake                         3.12.0\n",
            "cmdstanpy                     0.9.5\n",
            "colorcet                      3.0.0\n",
            "colorlover                    0.3.0\n",
            "community                     1.0.0b1\n",
            "contextlib2                   0.5.5\n",
            "convertdate                   2.4.0\n",
            "coverage                      3.7.1\n",
            "coveralls                     0.5\n",
            "crcmod                        1.7\n",
            "cufflinks                     0.17.3\n",
            "cvxopt                        1.2.7\n",
            "cvxpy                         1.0.31\n",
            "cycler                        0.11.0\n",
            "cymem                         2.0.6\n",
            "Cython                        0.29.28\n",
            "daft                          0.0.4\n",
            "dask                          2.12.0\n",
            "datascience                   0.10.6\n",
            "debugpy                       1.0.0\n",
            "decorator                     4.4.2\n",
            "defusedxml                    0.7.1\n",
            "descartes                     1.1.0\n",
            "dill                          0.3.4\n",
            "distributed                   1.25.3\n",
            "dlib                          19.18.0\n",
            "dm-tree                       0.1.6\n",
            "docopt                        0.6.2\n",
            "docutils                      0.17.1\n",
            "dopamine-rl                   1.0.5\n",
            "earthengine-api               0.1.303\n",
            "easydict                      1.9\n",
            "ecos                          2.0.10\n",
            "editdistance                  0.5.3\n",
            "en-core-web-sm                2.2.5\n",
            "entrypoints                   0.4\n",
            "ephem                         4.1.3\n",
            "et-xmlfile                    1.1.0\n",
            "fa2                           0.3.5\n",
            "fastai                        1.0.61\n",
            "fastdtw                       0.3.4\n",
            "fastprogress                  1.0.2\n",
            "fastrlock                     0.8\n",
            "fbprophet                     0.7.1\n",
            "feather-format                0.4.1\n",
            "filelock                      3.6.0\n",
            "firebase-admin                4.4.0\n",
            "fix-yahoo-finance             0.0.22\n",
            "Flask                         1.1.4\n",
            "flatbuffers                   2.0\n",
            "folium                        0.8.3\n",
            "future                        0.16.0\n",
            "gast                          0.5.3\n",
            "GDAL                          2.2.2\n",
            "gdown                         4.2.2\n",
            "gensim                        3.6.0\n",
            "geographiclib                 1.52\n",
            "geopy                         1.17.0\n",
            "gin-config                    0.5.0\n",
            "glob2                         0.7\n",
            "google                        2.0.3\n",
            "google-api-core               1.26.3\n",
            "google-api-python-client      1.12.11\n",
            "google-auth                   1.35.0\n",
            "google-auth-httplib2          0.0.4\n",
            "google-auth-oauthlib          0.4.6\n",
            "google-cloud-bigquery         1.21.0\n",
            "google-cloud-bigquery-storage 1.1.0\n",
            "google-cloud-core             1.0.3\n",
            "google-cloud-datastore        1.8.0\n",
            "google-cloud-firestore        1.7.0\n",
            "google-cloud-language         1.2.0\n",
            "google-cloud-storage          1.18.1\n",
            "google-cloud-translate        1.5.0\n",
            "google-colab                  1.0.0\n",
            "google-pasta                  0.2.0\n",
            "google-resumable-media        0.4.1\n",
            "googleapis-common-protos      1.56.0\n",
            "googledrivedownloader         0.4\n",
            "graphviz                      0.10.1\n",
            "greenlet                      1.1.2\n",
            "grpcio                        1.44.0\n",
            "gspread                       3.4.2\n",
            "gspread-dataframe             3.0.8\n",
            "gym                           0.17.3\n",
            "h5py                          3.1.0\n",
            "HeapDict                      1.0.1\n",
            "hijri-converter               2.2.3\n",
            "holidays                      0.10.5.2\n",
            "holoviews                     1.14.8\n",
            "html5lib                      1.0.1\n",
            "httpimport                    0.5.18\n",
            "httplib2                      0.17.4\n",
            "httplib2shim                  0.0.3\n",
            "humanize                      0.5.1\n",
            "hyperopt                      0.1.2\n",
            "ideep4py                      2.0.0.post3\n",
            "idna                          2.10\n",
            "imageio                       2.4.1\n",
            "imagesize                     1.3.0\n",
            "imbalanced-learn              0.8.1\n",
            "imblearn                      0.0\n",
            "imgaug                        0.2.9\n",
            "importlib-metadata            4.11.3\n",
            "importlib-resources           5.4.0\n",
            "imutils                       0.5.4\n",
            "inflect                       2.1.0\n",
            "iniconfig                     1.1.1\n",
            "intel-openmp                  2022.0.2\n",
            "intervaltree                  2.1.0\n",
            "ipykernel                     4.10.1\n",
            "ipython                       5.5.0\n",
            "ipython-genutils              0.2.0\n",
            "ipython-sql                   0.3.9\n",
            "ipywidgets                    7.7.0\n",
            "itsdangerous                  1.1.0\n",
            "jax                           0.3.4\n",
            "jaxlib                        0.3.2+cuda11.cudnn805\n",
            "jedi                          0.18.1\n",
            "jieba                         0.42.1\n",
            "Jinja2                        2.11.3\n",
            "joblib                        1.1.0\n",
            "jpeg4py                       0.1.4\n",
            "jsonschema                    4.3.3\n",
            "jupyter                       1.0.0\n",
            "jupyter-client                5.3.5\n",
            "jupyter-console               5.2.0\n",
            "jupyter-core                  4.9.2\n",
            "jupyterlab-pygments           0.1.2\n",
            "jupyterlab-widgets            1.1.0\n",
            "kaggle                        1.5.12\n",
            "kapre                         0.3.7\n",
            "keras                         2.8.0\n",
            "Keras-Preprocessing           1.1.2\n",
            "keras-vis                     0.4.1\n",
            "kiwisolver                    1.4.0\n",
            "korean-lunar-calendar         0.2.1\n",
            "libclang                      13.0.0\n",
            "librosa                       0.8.1\n",
            "lightgbm                      2.2.3\n",
            "llvmlite                      0.34.0\n",
            "lmdb                          0.99\n",
            "LunarCalendar                 0.0.9\n",
            "lxml                          4.2.6\n",
            "Markdown                      3.3.6\n",
            "MarkupSafe                    2.0.1\n",
            "matplotlib                    3.2.2\n",
            "matplotlib-inline             0.1.3\n",
            "matplotlib-venn               0.11.6\n",
            "missingno                     0.5.1\n",
            "mistune                       0.8.4\n",
            "mizani                        0.6.0\n",
            "mkl                           2019.0\n",
            "mlxtend                       0.14.0\n",
            "more-itertools                8.12.0\n",
            "moviepy                       0.2.3.5\n",
            "mpmath                        1.2.1\n",
            "msgpack                       1.0.3\n",
            "multiprocess                  0.70.12.2\n",
            "multitasking                  0.0.10\n",
            "murmurhash                    1.0.6\n",
            "music21                       5.5.0\n",
            "natsort                       5.5.0\n",
            "nbclient                      0.5.13\n",
            "nbconvert                     5.6.1\n",
            "nbformat                      5.2.0\n",
            "nest-asyncio                  1.5.4\n",
            "netCDF4                       1.5.8\n",
            "networkx                      2.6.3\n",
            "nibabel                       3.0.2\n",
            "nltk                          3.2.5\n",
            "notebook                      5.3.1\n",
            "numba                         0.51.2\n",
            "numexpr                       2.8.1\n",
            "numpy                         1.21.5\n",
            "nvidia-ml-py3                 7.352.0\n",
            "oauth2client                  4.1.3\n",
            "oauthlib                      3.2.0\n",
            "okgrade                       0.4.3\n",
            "opencv-contrib-python         4.1.2.30\n",
            "opencv-python                 4.1.2.30\n",
            "openpyxl                      3.0.9\n",
            "opt-einsum                    3.3.0\n",
            "osqp                          0.6.2.post0\n",
            "packaging                     21.3\n",
            "palettable                    3.3.0\n",
            "pandas                        1.3.5\n",
            "pandas-datareader             0.9.0\n",
            "pandas-gbq                    0.13.3\n",
            "pandas-profiling              1.4.1\n",
            "pandocfilters                 1.5.0\n",
            "panel                         0.12.1\n",
            "param                         1.12.0\n",
            "parso                         0.8.3\n",
            "pathlib                       1.0.1\n",
            "patsy                         0.5.2\n",
            "pep517                        0.12.0\n",
            "pexpect                       4.8.0\n",
            "pickleshare                   0.7.5\n",
            "Pillow                        7.1.2\n",
            "pip                           21.1.3\n",
            "pip-tools                     6.2.0\n",
            "plac                          1.1.3\n",
            "plotly                        5.5.0\n",
            "plotnine                      0.6.0\n",
            "pluggy                        0.7.1\n",
            "pooch                         1.6.0\n",
            "portpicker                    1.3.9\n",
            "prefetch-generator            1.0.1\n",
            "preshed                       3.0.6\n",
            "prettytable                   3.2.0\n",
            "progressbar2                  3.38.0\n",
            "prometheus-client             0.13.1\n",
            "promise                       2.3\n",
            "prompt-toolkit                1.0.18\n",
            "protobuf                      3.17.3\n",
            "psutil                        5.4.8\n",
            "psycopg2                      2.7.6.1\n",
            "ptyprocess                    0.7.0\n",
            "py                            1.11.0\n",
            "pyarrow                       6.0.1\n",
            "pyasn1                        0.4.8\n",
            "pyasn1-modules                0.2.8\n",
            "pycocotools                   2.0.4\n",
            "pycparser                     2.21\n",
            "pyct                          0.4.8\n",
            "pydata-google-auth            1.4.0\n",
            "pydot                         1.3.0\n",
            "pydot-ng                      2.0.0\n",
            "pydotplus                     2.0.2\n",
            "PyDrive                       1.3.1\n",
            "pyemd                         0.5.1\n",
            "pyerfa                        2.0.0.1\n",
            "pyglet                        1.5.0\n",
            "Pygments                      2.6.1\n",
            "pygobject                     3.26.1\n",
            "pymc3                         3.11.4\n",
            "PyMeeus                       0.5.11\n",
            "pymongo                       4.0.2\n",
            "pymystem3                     0.2.0\n",
            "PyOpenGL                      3.1.6\n",
            "pyparsing                     3.0.7\n",
            "pyrsistent                    0.18.1\n",
            "pysndfile                     1.3.8\n",
            "PySocks                       1.7.1\n",
            "pystan                        2.19.1.1\n",
            "pytest                        3.6.4\n",
            "python-apt                    0.0.0\n",
            "python-chess                  0.23.11\n",
            "python-dateutil               2.8.2\n",
            "python-louvain                0.16\n",
            "python-slugify                6.1.1\n",
            "python-utils                  3.1.0\n",
            "pytz                          2018.9\n",
            "pyviz-comms                   2.1.0\n",
            "PyWavelets                    1.3.0\n",
            "PyYAML                        3.13\n",
            "pyzmq                         22.3.0\n",
            "qdldl                         0.1.5.post0\n",
            "qtconsole                     5.2.2\n",
            "QtPy                          2.0.1\n",
            "regex                         2019.12.20\n",
            "requests                      2.23.0\n",
            "requests-oauthlib             1.3.1\n",
            "resampy                       0.2.2\n",
            "rpy2                          3.4.5\n",
            "rsa                           4.8\n",
            "scikit-image                  0.18.3\n",
            "scikit-learn                  1.0.2\n",
            "scipy                         1.4.1\n",
            "screen-resolution-extra       0.0.0\n",
            "scs                           3.2.0\n",
            "seaborn                       0.11.2\n",
            "semver                        2.13.0\n",
            "Send2Trash                    1.8.0\n",
            "setuptools                    57.4.0\n",
            "setuptools-git                1.2\n",
            "Shapely                       1.8.1.post1\n",
            "simplegeneric                 0.8.1\n",
            "six                           1.15.0\n",
            "sklearn                       0.0\n",
            "sklearn-pandas                1.8.0\n",
            "smart-open                    5.2.1\n",
            "snowballstemmer               2.2.0\n",
            "sortedcontainers              2.4.0\n",
            "SoundFile                     0.10.3.post1\n",
            "soupsieve                     2.3.1\n",
            "spacy                         2.2.4\n",
            "Sphinx                        1.8.6\n",
            "sphinxcontrib-serializinghtml 1.1.5\n",
            "sphinxcontrib-websupport      1.2.4\n",
            "SQLAlchemy                    1.4.32\n",
            "sqlparse                      0.4.2\n",
            "srsly                         1.0.5\n",
            "statsmodels                   0.10.2\n",
            "sympy                         1.7.1\n",
            "tables                        3.7.0\n",
            "tabulate                      0.8.9\n",
            "tblib                         1.7.0\n",
            "tenacity                      8.0.1\n",
            "tensorboard                   2.8.0\n",
            "tensorboard-data-server       0.6.1\n",
            "tensorboard-plugin-wit        1.8.1\n",
            "tensorflow                    2.8.0\n",
            "tensorflow-datasets           4.0.1\n",
            "tensorflow-estimator          2.8.0\n",
            "tensorflow-gcs-config         2.8.0\n",
            "tensorflow-hub                0.12.0\n",
            "tensorflow-io-gcs-filesystem  0.24.0\n",
            "tensorflow-metadata           1.7.0\n",
            "tensorflow-probability        0.16.0\n",
            "termcolor                     1.1.0\n",
            "terminado                     0.13.3\n",
            "testpath                      0.6.0\n",
            "text-unidecode                1.3\n",
            "textblob                      0.15.3\n",
            "Theano-PyMC                   1.1.2\n",
            "thinc                         7.4.0\n",
            "threadpoolctl                 3.1.0\n",
            "tifffile                      2021.11.2\n",
            "tomli                         2.0.1\n",
            "toolz                         0.11.2\n",
            "torch                         1.10.0+cu111\n",
            "torchaudio                    0.10.0+cu111\n",
            "torchsummary                  1.5.1\n",
            "torchtext                     0.11.0\n",
            "torchvision                   0.11.1+cu111\n",
            "tornado                       5.1.1\n",
            "tqdm                          4.63.0\n",
            "traitlets                     5.1.1\n",
            "tweepy                        3.10.0\n",
            "typeguard                     2.7.1\n",
            "typing-extensions             3.10.0.2\n",
            "tzlocal                       1.5.1\n",
            "uritemplate                   3.0.1\n",
            "urllib3                       1.24.3\n",
            "vega-datasets                 0.9.0\n",
            "wasabi                        0.9.0\n",
            "wcwidth                       0.2.5\n",
            "webencodings                  0.5.1\n",
            "Werkzeug                      1.0.1\n",
            "wheel                         0.37.1\n",
            "widgetsnbextension            3.6.0\n",
            "wordcloud                     1.5.0\n",
            "wrapt                         1.14.0\n",
            "xarray                        0.18.2\n",
            "xgboost                       0.90\n",
            "xkit                          0.0.0\n",
            "xlrd                          1.1.0\n",
            "xlwt                          1.3.0\n",
            "yellowbrick                   1.4\n",
            "zict                          2.1.0\n",
            "zipp                          3.7.0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Установка библиотеку\n",
        "!pip install camelcase"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "m5BGCkqrEKzP",
        "outputId": "1921b947-5ab9-4c06-9932-831e827127ef"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Collecting camelcase\n",
            "  Downloading camelcase-0.2.tar.gz (1.3 kB)\n",
            "Building wheels for collected packages: camelcase\n",
            "  Building wheel for camelcase (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for camelcase: filename=camelcase-0.2-py3-none-any.whl size=1789 sha256=1f6bf51d550151c5140d5e0b5d9add1aaa986e29cc9c9c5a590675f33d03775c\n",
            "  Stored in directory: /root/.cache/pip/wheels/5c/15/30/92459bdd828bf363082955935e49d5b3f83f8e32af8aad7e34\n",
            "Successfully built camelcase\n",
            "Installing collected packages: camelcase\n",
            "Successfully installed camelcase-0.2\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import camelcase\n",
        "\n",
        "c = camelcase.CamelCase()\n",
        "\n",
        "txt = \"hello world\"\n",
        "\n",
        "print(c.hump(txt))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "p1Wk1sZNEOTt",
        "outputId": "c96e2246-f272-4458-86c9-e1083d3d224f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Hello World\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Дополнительные пакеты можно найти в Индексе пакетов Python (PyPI) - хранилище библиотек для языка программирования Python.\n",
        "\n",
        "https://pypi.org/"
      ],
      "metadata": {
        "id": "sfuPVvnFEVM7"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Библиотека NLTK (Natural Language Processing Toolkit)\n",
        "\n",
        "Пакет библиотек и программ для символьной и статистической обработки естественного языка, написанных на языке программирования Python.\n",
        "\n",
        "У NLTK есть легкие в использовании интерфейсы для многих языковых корпусов, а также библиотеки для обработки текстов для классификации, токенизации, стемминга, разметки, фильтрации и семантических рассуждений."
      ],
      "metadata": {
        "id": "pxx5g4m3EgdB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -U nltk"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UWXypneRFase",
        "outputId": "06c2c775-1b42-4974-8158-99a18a522ab6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)\n",
            "Collecting nltk\n",
            "  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)\n",
            "\u001b[K     |████████████████████████████████| 1.5 MB 24.9 MB/s \n",
            "\u001b[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk) (4.62.3)\n",
            "Collecting regex>=2021.8.3\n",
            "  Downloading regex-2022.1.18-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (748 kB)\n",
            "\u001b[K     |████████████████████████████████| 748 kB 42.4 MB/s \n",
            "\u001b[?25hRequirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from nltk) (1.1.0)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from nltk) (7.1.2)\n",
            "Installing collected packages: regex, nltk\n",
            "  Attempting uninstall: regex\n",
            "    Found existing installation: regex 2019.12.20\n",
            "    Uninstalling regex-2019.12.20:\n",
            "      Successfully uninstalled regex-2019.12.20\n",
            "  Attempting uninstall: nltk\n",
            "    Found existing installation: nltk 3.2.5\n",
            "    Uninstalling nltk-3.2.5:\n",
            "      Successfully uninstalled nltk-3.2.5\n",
            "Successfully installed nltk-3.7 regex-2022.1.18\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import nltk"
      ],
      "metadata": {
        "id": "agNU7_mUHYbT"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Для работы могут понадобиться дополнительные данные\n",
        "# punkt - пунктуация\n",
        "nltk.download('punkt')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-yBzskXAH7ns",
        "outputId": "752c4f53-274c-4e41-98fe-0cba164b1455"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Токенизация\n",
        "\n",
        "Токенизация (иногда – сегментация) по предложениям – это процесс разделения письменного языка на предложения-компоненты. Идея выглядит довольно простой. В английском и некоторых других языках мы можем вычленять предложение каждый раз, когда находим определенный знак пунктуации – точку.\n",
        "\n",
        "Но даже в английском эта задача нетривиальна, так как точка используется и в сокращениях. Таблица сокращений может сильно помочь во время обработки текста, чтобы избежать неверной расстановки границ предложений. В большинстве случаев для этого используются библиотеки, так что можете особо не переживать о деталях реализации.\n",
        "\n",
        "Токен  —  это последовательность символов в документе, имеющая значение для анализа. Обычно это отдельные слова, но не всегда. Документ — это коллекция текста. Им может быть твит, книга или что-то еще. \n",
        "\n",
        "Признаки хороших токенов:\n",
        "\n",
        "* Хранятся в перечисляемых структурах (список, генератор) для упрощения анализа в будущем.\n",
        "* Имеют единый регистр для одной цели.\n",
        "* Содержат только буквы и цифры."
      ],
      "metadata": {
        "id": "8gQhvjtuB-Jy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Токенизация необходима для разделения текста на более мелкие части, с которыми легче работать и из которых можно извлечь полезную информацию."
      ],
      "metadata": {
        "id": "GY-LJkWFJOaS"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Токенизация по предложениям"
      ],
      "metadata": {
        "id": "yOobKOkTI_R8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Токенизация с помощью регулярных выражений"
      ],
      "metadata": {
        "id": "gVOpG4_gITHK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import re\n",
        "\n",
        "text = '''Привет, меня зовут Андрей. \n",
        "Мне 22 года. Я учусь в университете.\n",
        "Т.к. я учусь - я студент.'''\n",
        "\n",
        "regexp = r'\\.\\s'\n",
        "\n",
        "sentences = re.split(regexp, text)\n",
        "sentences"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hO1FsyCyIRVE",
        "outputId": "bed7da57-2453-46db-f3bd-c53fcaaa9542"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['Привет, меня зовут Андрей',\n",
              " '\\nМне 22 года',\n",
              " 'Я учусь в университете',\n",
              " 'Т.к',\n",
              " 'я учусь - я студент.']"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Токенизцаия с помощью NLTK"
      ],
      "metadata": {
        "id": "Ksa0pCieIW1u"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PdsRjn-NB48-",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "51f488d4-d894-4979-d411-0c659e3875ef"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['Привет, меня зовут Андрей.',\n",
              " 'Мне 22 года.',\n",
              " 'Я учусь в университете.',\n",
              " 'Т.к. я учусь - я студент.']"
            ]
          },
          "metadata": {},
          "execution_count": 9
        }
      ],
      "source": [
        "text = '''Привет, меня зовут Андрей. \n",
        "Мне 22 года. \n",
        "Я учусь в университете. \n",
        "Т.к. я учусь - я студент.'''\n",
        "\n",
        "sentences = nltk.tokenize.sent_tokenize(text, language='russian')\n",
        "sentences"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Токенизация по словам"
      ],
      "metadata": {
        "id": "C0gcbA5xI8Tt"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Токенизация с помощью регулярных выражений"
      ],
      "metadata": {
        "id": "f_oH6bqEJHJy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = '''Привет, меня зовут Андрей. \n",
        "Мне 22 года. Я учусь в университете. \n",
        "Т.к. я учусь - я студент.'''\n",
        "\n",
        "regexp = r'\\w+'\n",
        "\n",
        "tokens = re.findall(regexp, text)\n",
        "tokens"
      ],
      "metadata": {
        "id": "ISXcWO_SB-fS",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b22e9e03-df73-41a8-eb71-870eb43abfcb"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['Привет',\n",
              " 'меня',\n",
              " 'зовут',\n",
              " 'Андрей',\n",
              " 'Мне',\n",
              " '22',\n",
              " 'года',\n",
              " 'Я',\n",
              " 'учусь',\n",
              " 'в',\n",
              " 'университете',\n",
              " 'Т',\n",
              " 'к',\n",
              " 'я',\n",
              " 'учусь',\n",
              " 'я',\n",
              " 'студент']"
            ]
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Токенизация с помощью NLTK"
      ],
      "metadata": {
        "id": "brL6H9i7LAJX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = '''Привет, меня зовут Андрей. \n",
        "Мне 22 года. Я учусь в университете. \n",
        "Т.к. я учусь - я студент.'''\n",
        "\n",
        "tokens = nltk.tokenize.word_tokenize(text, language='russian')\n",
        "tokens"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CyiE0H1TK_Zh",
        "outputId": "e44dc99d-0077-4289-9edb-509fc6392d4d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['Привет',\n",
              " ',',\n",
              " 'меня',\n",
              " 'зовут',\n",
              " 'Андрей',\n",
              " '.',\n",
              " 'Мне',\n",
              " '22',\n",
              " 'года',\n",
              " '.',\n",
              " 'Я',\n",
              " 'учусь',\n",
              " 'в',\n",
              " 'университете',\n",
              " '.',\n",
              " 'Т.к.',\n",
              " 'я',\n",
              " 'учусь',\n",
              " '-',\n",
              " 'я',\n",
              " 'студент',\n",
              " '.']"
            ]
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Удаление стоп-слов\n",
        "\n",
        "Стоп-слова – это слова, которые выкидываются из текста до/после обработки текста. Под стоп-словами обычно понимают артикли, междометия, союзы и т.д., которые не несут смысловой нагрузки. При этом надо понимать, что не существует универсального списка стоп-слов, все зависит от конкретного случая."
      ],
      "metadata": {
        "id": "38w_j7qJQGNZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "nltk.download('stopwords')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_ypuVPpNQFph",
        "outputId": "35576acd-7287-4d5b-940b-cd5786c67610"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "russian_stop_words = nltk.corpus.stopwords.words('russian')\n",
        "russian_stop_words"
      ],
      "metadata": {
        "id": "eIXjgxOHQfcE",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "ceed4b42-f093-4cf3-cda3-89e6c79f3ccd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['и',\n",
              " 'в',\n",
              " 'во',\n",
              " 'не',\n",
              " 'что',\n",
              " 'он',\n",
              " 'на',\n",
              " 'я',\n",
              " 'с',\n",
              " 'со',\n",
              " 'как',\n",
              " 'а',\n",
              " 'то',\n",
              " 'все',\n",
              " 'она',\n",
              " 'так',\n",
              " 'его',\n",
              " 'но',\n",
              " 'да',\n",
              " 'ты',\n",
              " 'к',\n",
              " 'у',\n",
              " 'же',\n",
              " 'вы',\n",
              " 'за',\n",
              " 'бы',\n",
              " 'по',\n",
              " 'только',\n",
              " 'ее',\n",
              " 'мне',\n",
              " 'было',\n",
              " 'вот',\n",
              " 'от',\n",
              " 'меня',\n",
              " 'еще',\n",
              " 'нет',\n",
              " 'о',\n",
              " 'из',\n",
              " 'ему',\n",
              " 'теперь',\n",
              " 'когда',\n",
              " 'даже',\n",
              " 'ну',\n",
              " 'вдруг',\n",
              " 'ли',\n",
              " 'если',\n",
              " 'уже',\n",
              " 'или',\n",
              " 'ни',\n",
              " 'быть',\n",
              " 'был',\n",
              " 'него',\n",
              " 'до',\n",
              " 'вас',\n",
              " 'нибудь',\n",
              " 'опять',\n",
              " 'уж',\n",
              " 'вам',\n",
              " 'ведь',\n",
              " 'там',\n",
              " 'потом',\n",
              " 'себя',\n",
              " 'ничего',\n",
              " 'ей',\n",
              " 'может',\n",
              " 'они',\n",
              " 'тут',\n",
              " 'где',\n",
              " 'есть',\n",
              " 'надо',\n",
              " 'ней',\n",
              " 'для',\n",
              " 'мы',\n",
              " 'тебя',\n",
              " 'их',\n",
              " 'чем',\n",
              " 'была',\n",
              " 'сам',\n",
              " 'чтоб',\n",
              " 'без',\n",
              " 'будто',\n",
              " 'чего',\n",
              " 'раз',\n",
              " 'тоже',\n",
              " 'себе',\n",
              " 'под',\n",
              " 'будет',\n",
              " 'ж',\n",
              " 'тогда',\n",
              " 'кто',\n",
              " 'этот',\n",
              " 'того',\n",
              " 'потому',\n",
              " 'этого',\n",
              " 'какой',\n",
              " 'совсем',\n",
              " 'ним',\n",
              " 'здесь',\n",
              " 'этом',\n",
              " 'один',\n",
              " 'почти',\n",
              " 'мой',\n",
              " 'тем',\n",
              " 'чтобы',\n",
              " 'нее',\n",
              " 'сейчас',\n",
              " 'были',\n",
              " 'куда',\n",
              " 'зачем',\n",
              " 'всех',\n",
              " 'никогда',\n",
              " 'можно',\n",
              " 'при',\n",
              " 'наконец',\n",
              " 'два',\n",
              " 'об',\n",
              " 'другой',\n",
              " 'хоть',\n",
              " 'после',\n",
              " 'над',\n",
              " 'больше',\n",
              " 'тот',\n",
              " 'через',\n",
              " 'эти',\n",
              " 'нас',\n",
              " 'про',\n",
              " 'всего',\n",
              " 'них',\n",
              " 'какая',\n",
              " 'много',\n",
              " 'разве',\n",
              " 'три',\n",
              " 'эту',\n",
              " 'моя',\n",
              " 'впрочем',\n",
              " 'хорошо',\n",
              " 'свою',\n",
              " 'этой',\n",
              " 'перед',\n",
              " 'иногда',\n",
              " 'лучше',\n",
              " 'чуть',\n",
              " 'том',\n",
              " 'нельзя',\n",
              " 'такой',\n",
              " 'им',\n",
              " 'более',\n",
              " 'всегда',\n",
              " 'конечно',\n",
              " 'всю',\n",
              " 'между']"
            ]
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def remove_stop_words(tokens: str, stop_words: list) -> list:\n",
        "  clean_tokens = []\n",
        "  for token in tokens:\n",
        "    if token not in stop_words:\n",
        "      clean_tokens.append(token)\n",
        "  return clean_tokens"
      ],
      "metadata": {
        "id": "rIx1uQ0ZQ_Kr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "text = '''Привет, меня зовут Андрей. \n",
        "Мне 22 года. Я учусь в университете. \n",
        "Т.к. я учусь - я студент.'''.lower()\n",
        "\n",
        "tokens = nltk.tokenize.word_tokenize(text, language='russian')\n",
        "\n",
        "clean_tokens = remove_stop_words(tokens, russian_stop_words)\n",
        "clean_tokens"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mxQmwAliQ16a",
        "outputId": "2a24e5b0-ea16-41ca-9dea-887df07d780c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['привет',\n",
              " ',',\n",
              " 'зовут',\n",
              " 'андрей',\n",
              " '.',\n",
              " '22',\n",
              " 'года',\n",
              " '.',\n",
              " 'учусь',\n",
              " 'университете',\n",
              " '.',\n",
              " 'т.к.',\n",
              " 'учусь',\n",
              " '-',\n",
              " 'студент',\n",
              " '.']"
            ]
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Лемматизация и стемминг\n",
        "\n",
        "Обычно тексты содержат разные грамматические формы одного и того же слова, а также могут встречаться однокоренные слова. Лемматизация и стемминг преследуют цель привести все встречающиеся словоформы к одной, нормальной словарной форме.\n",
        "\n",
        "* Стемминг – это грубый эвристический процесс, который отрезает «лишнее» от корня слов, часто это приводит к потере словообразовательных суффиксов.\n",
        "\n",
        "* Лемматизация – это более тонкий процесс, который использует словарь и морфологический анализ, чтобы в итоге привести слово к его канонической форме – лемме.\n",
        "\n",
        "Отличие в том, что стеммер действует без знания контекста и, соответственно, не понимает разницу между словами, которые имеют разный смысл в зависимости от части речи. Однако у стеммеров есть и свои преимущества: их проще внедрить и они работают быстрее. Плюс, более низкая «аккуратность» может не иметь значения в некоторых случаях."
      ],
      "metadata": {
        "id": "2criw7A-Pc8N"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Стемминг\n",
        "\n",
        "Русский язык обладает богатой морфологической структурой. Слова *хороший* и *хорошая* имеют тот же смысл, но разную форму, например, *хорошая мебель* и *хороший стул*.\n",
        "\n",
        "Поэтому при работе с текстом имеет смысл привести слова к их *нормальной* форме."
      ],
      "metadata": {
        "id": "mouy-q4odnY0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "stemmer = nltk.stem.SnowballStemmer(language='russian')"
      ],
      "metadata": {
        "id": "96eYX4W0K_XW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "stemmer.stem('хороший')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "7DNNha2AK_U2",
        "outputId": "bcf7cc11-3708-4e0c-a751-b4992670a0aa"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хорош'"
            ]
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "stemmer.stem('хорошая')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "azYtD0sgK_St",
        "outputId": "38104b72-477d-48eb-e442-cc217e9319b1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хорош'"
            ]
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Проблемы могут возникнуть со словами, которые значительно изменяются в других формах:"
      ],
      "metadata": {
        "id": "Dqxtu53LeHEx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "stemmer.stem('хотеть')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "JOZ1AWkxK_QC",
        "outputId": "be2d21a1-a320-442e-9997-0acb1c8cfdfb"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хотет'"
            ]
          },
          "metadata": {},
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "stemmer.stem('хочу')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "kePu508CK_Ny",
        "outputId": "0dcf8584-d394-4a1b-bc4c-687fb4eedf69"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хоч'"
            ]
          },
          "metadata": {},
          "execution_count": 20
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Хотеть* и *хочу* — грамматические формы одного и то же слова, но стемминг обрубает окончания согласно своему алгоритму. Поэтому возможно следует применить другой метод — лемматизацию."
      ],
      "metadata": {
        "id": "ZsuUcyduezMi"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Лемматизация\n",
        "\n",
        "Над словом можно провести морфологический анализ и выявить его начальную форму. Например: *хочу*, *хотят*, *хотели* имеют начальную форму *хотеть*. \n",
        "\n",
        "Для приведения слов к начальной форме можно воспользоваться `pymorphy2` — инструментом для морфологического анализа русского и украинского языков."
      ],
      "metadata": {
        "id": "YO6LFIjOe72x"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install pymorphy2\n",
        "!pip install -U pymorphy2-dicts-ru"
      ],
      "metadata": {
        "id": "S7o2jejme7gB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import pymorphy2"
      ],
      "metadata": {
        "id": "cRzkjnlSK_LK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "morph = pymorphy2.MorphAnalyzer()"
      ],
      "metadata": {
        "id": "7ePK0Fl6K_Iw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "morph.parse('хочу')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7-VbPgvAK_GR",
        "outputId": "3a7ab911-2bf1-4924-fcc2-aad3ca2c7e51"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[Parse(word='хочу', tag=OpencorporaTag('VERB,impf,tran sing,1per,pres,indc'), normal_form='хотеть', score=1.0, methods_stack=((DictionaryAnalyzer(), 'хочу', 3136, 1),))]"
            ]
          },
          "metadata": {},
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "morph.parse('хочу')[0].normal_form"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "ZzBWGsxgK_EE",
        "outputId": "2b883ef1-3bf6-4916-a1a4-28599d1c70b5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хотеть'"
            ]
          },
          "metadata": {},
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "morph.parse('хотят')[0].normal_form"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "0PsP0aVmK_Bi",
        "outputId": "f7c368c4-4316-4c87-fd15-4ae0fb23f617"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хотеть'"
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "morph.parse('хотели')[0].normal_form"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "wbcesZgwK-_M",
        "outputId": "7c435567-747a-46d1-ad2d-921227e2b820"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'хотеть'"
            ]
          },
          "metadata": {},
          "execution_count": 27
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Частотные слова\n",
        "\n",
        "Частотный анализ является одним из сравнительно простых методов обработки текста на естественном языке (NLP). Его результатом является список слов, наиболее часто встречающихся в тексте. Частотный анализ также позволяет получить представление о тематике и основных понятиях текста."
      ],
      "metadata": {
        "id": "_6UcXrMvXQsg"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Проведение частотного анализа можно условно разделить на несколько этапов:\n",
        "\n",
        "1. Загрузка и обзор данных\n",
        "2. Очистка и предварительная обработка текста\n",
        "3. Удаление стоп-слов\n",
        "4. Перевод слов в основную форму\n",
        "5. Подсчёт статистики встречаемости слов в тексте"
      ],
      "metadata": {
        "id": "dMNZyEm2ZwLg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Скачивание файла с \"Идиотом\" Ф.М. Достоевского\n",
        "!wget -O idiot.txt https://github.com/d0rj/RusLit/raw/main/prose/Dostoevsky/%D0%98%D0%B4%D0%B8%D0%BE%D1%82.txt"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wUk3qnErbEKX",
        "outputId": "2266f5ac-eba4-494a-c261-08088ed76809"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-03-25 12:55:00--  https://github.com/d0rj/RusLit/raw/main/prose/Dostoevsky/%D0%98%D0%B4%D0%B8%D0%BE%D1%82.txt\n",
            "Resolving github.com (github.com)... 140.82.113.3\n",
            "Connecting to github.com (github.com)|140.82.113.3|:443... connected.\n",
            "HTTP request sent, awaiting response... 302 Found\n",
            "Location: https://raw.githubusercontent.com/d0rj/RusLit/main/prose/Dostoevsky/%D0%98%D0%B4%D0%B8%D0%BE%D1%82.txt [following]\n",
            "--2022-03-25 12:55:01--  https://raw.githubusercontent.com/d0rj/RusLit/main/prose/Dostoevsky/%D0%98%D0%B4%D0%B8%D0%BE%D1%82.txt\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 2362687 (2.3M) [text/plain]\n",
            "Saving to: ‘idiot.txt’\n",
            "\n",
            "idiot.txt           100%[===================>]   2.25M  --.-KB/s    in 0.01s   \n",
            "\n",
            "2022-03-25 12:55:01 (217 MB/s) - ‘idiot.txt’ saved [2362687/2362687]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def load_data(path_to_file: str) -> str:\n",
        "  with open(path_to_file, 'r', encoding='utf-8') as df:\n",
        "    text = df.read()\n",
        "    text = text.lower()\n",
        "    text = text.replace('\\n', ' ')\n",
        "    return text"
      ],
      "metadata": {
        "id": "hoe8v8DZK-8r"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "text = load_data('idiot.txt').lower()\n",
        "text[:100]"
      ],
      "metadata": {
        "id": "jkoKXjZCK-6V",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "1e7ef418-3e70-40ca-c569-c4c5ceb1d793"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "' часть первая   i  в конце ноября, в оттепель, часов в девять утра, поезд петербургско-варшавской же'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def extract_tokens(text: str) -> list:\n",
        "  return nltk.tokenize.word_tokenize(text, language='russian')"
      ],
      "metadata": {
        "id": "CqJCk3V5K-35"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "def extract_sentences(text: str) -> list:\n",
        "  return nltk.tokenize.sent_tokenize(text, language='russian')"
      ],
      "metadata": {
        "id": "eSZnlZEFcvAi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "tokens = extract_tokens(text)\n",
        "print(f'Несколько токенов: {tokens[:10]}')\n",
        "print(f'Общее количество токенов: {len(tokens)}\\n')\n",
        "\n",
        "sentences = extract_sentences(text)\n",
        "print(f'Два предложения: {sentences[:2]}')\n",
        "print(f'Общее количество предложений: {len(sentences)}')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 304
        },
        "id": "QCBHEhR_c8q7",
        "outputId": "01d63ad4-63e1-4a81-ba9c-80299979ce07"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "error",
          "ename": "NameError",
          "evalue": "ignored",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
            "\u001b[0;32m<ipython-input-10-9d8e68033869>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtokens\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mextract_tokens\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Несколько токенов: {tokens[:10]}'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'Общее количество токенов: {len(tokens)}\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0msentences\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mextract_sentences\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;32m<ipython-input-8-b4b35d110e05>\u001b[0m in \u001b[0;36mextract_tokens\u001b[0;34m(text)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mextract_tokens\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m   \u001b[0;32mreturn\u001b[0m \u001b[0mnltk\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtokenize\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mword_tokenize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlanguage\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'russian'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m: name 'nltk' is not defined"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import string\n",
        "\n",
        "def count_frequencies(tokens: list) -> dict:\n",
        "  special_symbols = '<>«»-–+\\\\' + string.punctuation\n",
        "  \n",
        "  lemmas = [morph.parse(token)[0].normal_form for token in tokens \n",
        "            if token not in special_symbols and token not in russian_stop_words]\n",
        "  \n",
        "  return nltk.probability.FreqDist(lemmas)"
      ],
      "metadata": {
        "id": "CXFV2iRpK-1E"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "word_frequencies = count_frequencies(tokens)\n",
        "word_frequencies.most_common(5)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zWcWPJ9Ogaiz",
        "outputId": "eecd1657-705d-48e8-8061-dd0873d0a257"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('князь', 2167),\n",
              " ('это', 1720),\n",
              " ('знать', 849),\n",
              " ('всё', 795),\n",
              " ('говорить', 790)]"
            ]
          },
          "metadata": {},
          "execution_count": 46
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "word_frequencies.plot(30, cumulative=False)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 355
        },
        "id": "2CSIkOZQgaeg",
        "outputId": "a7a74b40-e99a-450a-b95b-c1a79bdcb0ac"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAFBCAYAAACGk4NZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOydeXwdVfXAvydrk7ZJujctpaWlZSvQkiCoiOwCIuCGICiu/PgJKvpTARc25SciuIACFQFBEFwQpf2BUBBaQLaUrrRAS/c2bemeNmnW8/vj3pdMJ/OSeUleXpKe7+cznzdz58yZ++bNmzP3nnPPFVXFMAzDMNoiK9MVMAzDMHo+ZiwMwzCMdjFjYRiGYbSLGQvDMAyjXcxYGIZhGO1ixsIwDMNol7QZCxEZIyLPichiEXlTRL7py38uIm+JyAIReUxESnz5OBGpEZF5frkroKtMRBaKyDIRuU1EJF31NgzDMFoj6RpnISKlQKmqviEiA4E5wLnAfsC/VbVBRH4GoKpXisg4YIaqTo7Q9RrwDeBV4AngNlV9Mi0VNwzDMFqRtpaFqlaq6ht+vQpYAoxW1adVtcGLvYIzHknxRqdIVV9RZ9kewBkdwzAMo5vI6Y6T+FbDVFzLIMiXgD8Htg8QkbnATuCHqvoCMBpYG5BZ68vaZOjQoTpu3LgO1bempoaCgoIukzOdptN0ms6eqDPMnDlzNqvqsKh9aTcWIjIAeBS4QlV3Bsp/ADQAD/miSmB/Vd0iImXAP0TksBTPdQlwCUBpaSnTpk3rUJ2rq6spLCzsMjnTaTpNp+nsiTrDlJeXr0q6U1XTtgC5wFPAt0PlXwBeBgrbOPZ5oBwoBd4KlF8ATGvv3GVlZdpRKioqulTOdJpO02k6e6LOMECFJnmmpjMaSoB7gCWq+otA+enA94CzVbU6UD5MRLL9+nhgIrBcVSuBnSJyrNf5eeCf6aq3YRiG0Zp0dkN9EPgcsFBE5vmy7wO3AfnATB8B+4qqXgocD9wgIvVAE3Cpqm71x30N+ANQADzpF8MwDKObSJuxUNUXgajxEE8kkX8U59uI2lcBtAqpNQzDMLoHG8FtGIZhtIsZC8MwDKNdzFiEaGhsYkdtU6arYRiG0aMwYxHgpWWbOfy6p/nt6zsyXRXDMIweRbeM4O4tjB1SSE19I+9saUJVsXyFhmEYDmtZBBhdUsCwgflU1Skrt1S3f4BhGMY+ghmLACLC1DElAMxdvS3DtTEMw+g5mLEIMXX/QQDMXb09wzUxDMPoOZixCDF1f9+yWGMtC8MwjARmLEIcsV8xWcCSyipq6hozXR3DMIwegRmLEIV5OYwtyaGxSVm4zkJoDcMwwIxFJBMH5wLm5DYMw0hgxiKCSUPyAHNyG4ZhJDBjEcGkIa5l8cbqbYkJlwzDMPZpzFhEUDogm+KCXDZV1VK5Y0+mq2MYhpFxzFhEkCXClObBedYVZRiGYcYiCc3jLczJbRiGYcYiGc0juddYy8IwDCNtxkJExojIcyKyWETeFJFv+vLBIjJTRJb6z0G+XETkNhFZJiILROSogK6LvfxSEbk4XXUOMmU/17JYuG4HdQ02v4VhGPs26WxZNAD/o6qHAscCl4nIocBVwLOqOhF41m8DnAFM9MslwJ3gjAtwLXAM8D7g2oSBSSfFhblMGNafuoYmllTuTPfpDMMwejRpMxaqWqmqb/j1KmAJMBo4B7jfi90PnOvXzwEeUMcrQImIlAIfAWaq6lZV3QbMBE5PV72DtCQVNL+FYRj7Nt3isxCRccBU4FVghKpW+l0bgBF+fTSwJnDYWl+WrDzttCQVNL+FYRj7NpLuQWciMgCYBdyoqn8Xke2qWhLYv01VB4nIDOAmVX3Rlz8LXAmcAPRT1Z/48h8BNap6S8S5LsF1YVFaWlo2ffr0DtW5urqawsJCVm6v539mbmFE/2zuOHNYUrlUdHaVnOk0nabTdHZWZ5jy8vI5qloeuVNV07YAucBTwLcDZW8DpX69FHjbr08DLgjLARcA0wLle8klW8rKyrSjVFRUqKpqfUOjHvKjJ3XslTP0vao9SeVS0dlVcqbTdJpO09lZnWGACk3yTE1nNJQA9wBLVPUXgV2PA4mIpouBfwbKP++joo4FdqjrrnoKOE1EBnnH9mm+LO3kZGdxxH7FAMyzwXmGYezDpNNn8UHgc8BJIjLPL2cCNwGnishS4BS/DfAEsBxYBtwNfA1AVbcCPwZe98sNvqxbaBlvYU5uwzD2XXLSpVid70GS7D45Ql6By5Louhe4t+tqF5+plvbDMAzDRnC3xxQfETV/zXYamywDrWEY+yZmLNph+MB+7DeogN11jSzdVJXp6hiGYWQEMxYxaBmcZ11RhmHsm5ixiEGL38Kc3IZh7JuYsYhBS7pya1kYhrFvYsYiBoeOKiIvO4ulm3axo6Y+09UxDMPodsxYxCA/J5vDRhcBsGCttS4Mw9j3MGMRk6ljzMltGMa+ixmLmNg0q4Zh7MuYsYhJMF25pjlTr2EYRk/DjEVMRpcUMGxgPtur61m5pTrT1TEMw+hWzFjERERsvIVhGPssZixSwEZyG4axr2LGIgVa/BbWsjAMY9/CjEUKHLFfMVkCSyqrqKlrzHR1DMMwug0zFilQmJfDwSOLaGxSFq7bkenqGIZhdBtmLFLExlsYhrEvYsYiRczJbRjGvkjajIWI3Csim0RkUaDsz4H5uFeKyDxfPk5EagL77gocUyYiC0VkmYjcJiLJpmrtFhItizdWb7PBeYZh7DOkbQ5u4A/Ab4AHEgWq+pnEuojcCgQ7/t9V1SkReu4Evgq8CjwBnA48mYb6xuKAIf0pLshlU1UtW2qaMlUNwzCMbiVtLQtVnQ1sjdrnWwfnAQ+3pUNESoEiVX1F3Wv8A8C5XV3XVMjKEqb4wXnvbLF05YZh7BtkymfxIWCjqi4NlB0gInNFZJaIfMiXjQbWBmTW+rKMkuiKemdrXYZrYhiG0T1IOvvdRWQcMENVJ4fK7wSWqeqtfjsfGKCqW0SkDPgHcBgwCbhJVU/xch8CrlTVs5Kc7xLgEoDS0tKy6dOnd6je1dXVFBYWJt0/d0MtP3lhGxMHZXPTKcO6RGeqcqbTdJpO09lZnWHKy8vnqGp55E5VTdsCjAMWhcpygI3Afm0c9zxQDpQCbwXKLwCmxTl3WVmZdpSKioo292/fXadjr5yhB149Q2vrG7tEZ6pyptN0mk7T2VmdYYAKTfJMzUQ31CneADR3L4nIMBHJ9uvjgYnAclWtBHaKyLHez/F54J8ZqPNeFBfmMmFYf+qbYEnlzkxXxzAMI+2kM3T2YeBl4CARWSsiX/a7zqe1Y/t4YIEPpf0bcKmqJpzjXwN+DywD3iWDkVBBDi5106yu2Lw7wzUxDMNIP2kLnVXVC5KUfyGi7FHg0STyFcDkqH2ZZFRxPwAqd+zJcE0MwzDSj43g7iAjiwsAqNxRk+GaGIZhpB8zFh0k0bJYv91aFoZh9H3MWHSQ0hLXstiw01oWhmH0fcxYdJBmn4W1LAzD2AcwY9FBhgzIJ1tgy+469tTbREiGYfRtzFh0kOwsYXCBu3wbLCLKMIw+jhmLTjC0MBuw8FnDMPo+Ziw6wZCChLEwJ7dhGH0bMxadYIi1LAzD2EcwY9EJhnqfxfrt1rIwDKNvY8aiEyRaFubgNgyjr2PGohMkHNzrzVgYhtHHMWPRCYb4bihzcBuG0dcxY9EJivKzyMvOYnt1PTV1NjDPMIy+ixmLTpAlwsjmVOXWujAMo+9ixqKTjLR5LQzD2AcwY9FJWlKVW8vCMIy+ixmLTpJIVW4tC8Mw+jLpnIP7XhHZJCKLAmXXicg6EZnnlzMD+64WkWUi8raIfCRQfrovWyYiV6Wrvh3Fplc1DGNfIJ0tiz8Ap0eU/1JVp/jlCQARORQ4HzjMH3OHiGSLSDbwW+AM4FDgAi/bY7DpVQ3D2BfISZdiVZ0tIuNiip8DPKKqtcAKEVkGvM/vW6aqywFE5BEvu7iLq9thSm0SJMMw9gEy4bO4XEQW+G6qQb5sNLAmILPWlyUr7zGM8j6L9dayMAyjDyOqmj7lrmUxQ1Un++0RwGZAgR8Dpar6JRH5DfCKqj7o5e4BnvRqTlfVr/jyzwHHqOrlSc53CXAJQGlpadn06dM7VO/q6moKCwtjyRUUFPDZv2+krgkePHc4BbnR9jcVnXHkTKfpNJ2ms7M6w5SXl89R1fLInaqatgUYByxqbx9wNXB1YN9TwPv98lSgfC+5tpaysjLtKBUVFSnJffjmf+vYK2fo0o07u0xnV8qaTtNpOk1nHIAKTfJM7dZuKBEpDWx+HEhESj0OnC8i+SJyADAReA14HZgoIgeISB7OCf54d9Y5DqXeyb3e/BaGYfRR0ubgFpGHgROAoSKyFrgWOEFEpuC6oVYC/wWgqm+KyF9wjusG4DJVbfR6Lse1NLKBe1X1zXTVuaOUlljKD8Mw+jbpjIa6IKL4njbkbwRujCh/AniiC6vW5YwqtoF5hmH0bWwEdxcw0sJnDcPo45ix6AJG+W4oC581DKOvYsaiCyi1bijDMPo4Ziy6gITPwubiNgyjr2LGogsoKsihIDebXbUN7NxTn+nqGIZhdDlmLLoAEWkJnzUnt2EYfZCUjYWIDBKRI9JRmd5MoivKnNyGYfRFYhkLEXleRIpEZDDwBnC3iPwivVXrXSSyz5rfwjCMvkjclkWxqu4EPgE8oKrHAKekr1q9j5ZU5dayMAyj7xHXWOT4vE7nATPSWJ9eS2lzqnJrWRiG0feIayyux+VnWqaqr4vIeGBp+qrV+2huWZjPwjCMPkjc3FCVqtrs1FbV5eaz2JvEJEg2MM8wjL5I3JbF7THL9lmC+aE0jRNKGYZhZII2WxYi8n7gA8AwEfl2YFcRLmW44Snql8uA/Bx21Tawo6aeksK8TFfJMAyjy2ivZZEHDMAZlYGBZSfwqfRWrfeR8FvYJEiGYfQ12mxZqOosYJaI/EFVV3VTnXotI4v7sXTTLjbsrOHQUUWZro5hGEaXEdfBnS8iv8PNm918jKqelI5K9VZG2fSqhmH0UeIai78CdwG/BxrTV53ejU2vahhGXyVuNFSDqt6pqq+p6pzE0tYBInKviGwSkUWBsp+LyFsiskBEHhOREl8+TkRqRGSeX+4KHFMmIgtFZJmI3CYi0qFv2g00T69qLQvDMPoYcY3FdBH5moiUisjgxNLOMX8ATg+VzQQm+zEb7wBXB/a9q6pT/HJpoPxO4KvARL+EdfYYmsNnbayFYRh9jLjdUBf7z+8GyhQYn+wAVZ0tIuNCZU8HNl+hnYgqn2KkSFVf8dsPAOcCT8asd7cyyrqhDMPoo0g6B5B5YzFDVSdH7JsO/FlVH/Ryb+JaGzuBH6rqCyJSDtykqqf4Yz4EXKmqZyU53yXAJQClpaVl06dP71C9q6urKSwsTFmupqGJix7bRF4W/OkTIwj2mHVUZzrqaTpNp+k0nVGUl5fPUdXyyJ2q2u4CfD5qiXHcOGBRRPkPgMdoMVb5wBC/XgaswQ38KweeCRz3IZzxabfOZWVl2lEqKio6LHf4tf/SsVfO0C27artMZ2dlTafpNJ2mMw5AhSZ5psbthjo6sN4POBk3r8UDMY9vRkS+AJwFnOwrh6rWArV+fY6IvAtMAtYB+wUO38+X9VhKiwvYuaeK9dtrGNzfRnEbhtE3iGUsVPXrwW0fxfRIqicTkdOB7wEfVtXqQPkwYKuqNvqMthOB5aq6VUR2isixwKu4Fk2PzklVWtKPtzdWUbljD5NHF2e6OoZhGF1C3JZFmN3AAW0JiMjDwAnAUBFZC1yLi37KB2b6/vxX1EU+HQ/cICL1QBNwqapu9aq+housKsA5tnukcztBaSJ81pzchmH0IWIZC++MTnjCs4FDgL+0dYyqXhBRfE8S2UeBR5PsqwBaOch7KqMsfNYwjD5I3JbFLYH1BmCVqq5NQ316PSNtelXDMPogsQblqUso+BYu4+wgoC6dlerNjLLpVQ3D6IPEMhYich7wGvBp3Dzcr4qIpSiPwKZXNQyjLxK3G+oHwNGqugmao5eeAf6Wror1VhIO7o07amlqUrKyemwqK8MwjNjEzQ2VlTAUni0pHLtPUZCXTUlhLnWNTWzZbb11hmH0DeK2LP4lIk8BD/vtzwBPpKdKvZ/S4gK2V9dTuaOGYQPzM10dwzCMTtNm60BEDhSRD6rqd4FpwBF+eRn4XTfUr1cyyqZXNQyjj9Fey+JX+DTiqvp34O8AInK43/extNaul5KYBGmDObkNw+gjtOd3GKGqC8OFvmxcWmrUB2gZxW0tC8Mw+gbtGYuSNvYVdGVF+hKJ8Fkba2EYRl+hPWNRISJfDReKyFeANqdV3ZdpblnYKG7DMPoI7fksrgAeE5ELaTEO5UAe8PF0Vqw30zJjnrUsDMPoG7RpLFR1I/ABETmRlmR+/6eq/057zXoxI4qcsdi4cw+NTUq2DcwzDKOXE3c+i+eA59Jclz5Dv9xshvTPY8vuOjbvqm02HoZhGL0VG4WdJhLhs+vNb2EYRh/AjEWaGFnknNwbzG9hGEYfwIxFmkg4uS181jCMvkBajYWI3Csim0RkUaBssIjMFJGl/nOQLxcRuU1ElonIAhE5KnDMxV5+qYhcnM46dxUWPmsYRl8i3S2LPwCnh8quAp5V1YnAs34b4Axgol8uAe4EZ1xw83cfA7wPuDZhYHoyFj5rGEZfIq3GQlVnA1tDxecA9/v1+4FzA+UPqOMVoERESoGPADNVdauqbgNm0toA9ThGFtkkSIZh9B0y4bMYoaqVfn0DMMKvjwbWBOTW+rJk5T2axPSq1rIwDKMvIKqa3hOIjANmqOpkv71dVUsC+7ep6iARmQHcpKov+vJngSuBE4B+qvoTX/4joEZVb4k41yW4LixKS0vLpk+f3qE6V1dXU1hY2Cm5+iblgkc3IsAjnxxB7Z6aTutMRz1Np+k0nfuuzjDl5eVzVLU8cqeqpnXBZaddFNh+Gyj166XA2359GnBBWA64AJgWKN9LLtlSVlamHaWioqJL5Mp/MlPHXjlD122r7jKdHZE1nabTdJrOOAAVmuSZmoluqMeBRETTxcA/A+Wf91FRxwI71HVXPQWcJiKDvGP7NF/W40lkn7WuKMMwejvpDp19GDer3kEislZEvgzcBJwqIkuBU/w2uGlalwPLgLuBrwGo6lbgx8DrfrnBl/V4WoyFObkNw+jdxJ2Du0Oo6gVJdp0cIavAZUn03Avc24VV6xZaxlrsobR/hitjGIbRCWwEdxppGcVtLQvDMHo3ZizSyMhiyw9lGEbfwIxFGhll06sahtFHMGORRkpLLD+UYRh9AzMWaWT4wHyyBN7bVUt9U3oHPxqGYaQTMxZpJDc7i2ED81GFbTVNma6OYRhGhzFjkWYS4bNbahozXBPDMIyOY8YizSTCZzdXm7EwDKP3YsYizTS3LMxYGIbRizFjkWYSKT+2mM/CMIxejBmLNNOc8mNXQyJrrmEYRq/DjEWaGTvE5ZWfu6GO86a9zNzV2zJcI8MwjNQxY5FmDhtVxI0fn0xRnvD6ym18/I7/cPmf3mD1lupMV80wDCM2ac06a4CIcOExYxmjm3hl+0DueXEFMxZU8tSbG/j8+8fx9ZMOpKQwL9PVNAzDaBNrWXQT/XOz+N7pB/Pcd07gE0eNpqFJuefFFRx/83PcPXs5tQ0WLWUYRs/FjEU3M6qkgF+cN4Xplx/HBw8cws49Ddz4xBJOvnUWj89fb05wwzB6JNYNlSEmjy7mwS8fw/PvvMdPn1jCOxt38Y2H57LfwGwmLHiNAfk59M/Ppn9+DgPzc+jvlwHN69ms3lrHwI1VFORmU5CX7T5zs8nKkkx/PcMw+hhmLDKIiHDiQcP50IFD+ductfxi5jusraplbdV78ZU8O7tVUX5OFoUJ45GXTaHUc1n+Bk47dIQZEsMwOkS3GwsROQj4c6BoPHANUAJ8FUg8Kb+vqk/4Y64Gvgw0At9Q1ae6r8bpJyc7i/Pftz/nTBnNo8+9xqix49lV28ju2gZ21zawK/CZKN9V28CW7TuR3H7U1DVSU9/Y/Fnb0ERtQxPbqG8+x6UPzuGgEQP5+skHcsbkUrLNaBiGkQLdbixU9W1gCoCIZAPrgMeALwK/VNVbgvIicihwPnAYMAp4RkQmqWqf8wgX5GVzyNA8yg4eEUt+zpw5lJWV7VXW1KTUNjRRXddATX0je+obefi5uTyxvJ63N1Zx+Z/mcuDwpVx+4oGcdUQpOdnmtjIMo30y/aQ4GXhXVVe1IXMO8Iiq1qrqCmAZ8L5uqV0vJCtLKMjLZsiAfPYbVMiBwwdy5oH9ef67J3DjxyczuqSAZZt2ccWf53HqL2fztzlrqW+0VCSGYbRNpo3F+cDDge3LRWSBiNwrIoN82WhgTUBmrS8zUiA/J5sLjxnL8989gZs/eQT7Dy5kxebdfOev8znp1ud55LXV1DWY0TAMIxrJVKimiOQB64HDVHWjiIwANgMK/BgoVdUvichvgFdU9UF/3D3Ak6r6twidlwCXAJSWlpZNnz69Q3Wrrq6msLCwy+R6os7GJuWFNXt4dPEu1u9yPXpDC7M484A8Tj+omPzstn0avfm7m07TaTqjKS8vn6Oq5ZE7VTUjC6576ekk+8YBi/z61cDVgX1PAe9vT39ZWZl2lIqKii6V68k6Gxqb9B9z1+optz6vY6+coWOvnKFH3fC0/ubfS3V7dV2PqafpNJ2ms+t1hgEqNMkzNZPdUBcQ6IISkdLAvo8Di/z648D5IpIvIgcAE4HXuq2WfZzsLOGcKaN56orjufPCo5gwKIctu+v4+VNv88Gb/s1Pn1zCpqo9ma6mYRgZJiPjLESkP3Aq8F+B4ptFZAquG2plYp+qvikifwEWAw3AZdoHI6EyTVaWcMbhpQyrXcee4nHcOWsZLy3bwrRZy7nvpZV8qmw//uv48Ywd0j/TVTUMIwNkxFio6m5gSKjsc23I3wjcmO56GW6g4HETh3LcxKHMW7Odu55/l6cWb+BPr67mkddW89EjRnHph8dnupqGYXQzNoLbSMqUMSXc9bkylm2qYtqs5Tw2dx3T569n+vz1HDkij0/WreT4icMYN9RaG4bR1zFjYbTLgcMH8vNPH8m3Tp3E719YwcOvrWb+xjrm//NNAPYfXMiHJg7l+EnD+MCEIQzsl5vhGhuG0dWYsTBiM6qkgGs+dihfP+lA7n7yVVbV9efFpZtZvbWah15dzUOvriY7Szhq/xKOnziM4ycNY/Lo4kxX2zCMLsCMhZEyg/rncfIBhZSVHUVjk7Jg7XZeWLqZ2e+8x9w123l95TZeX7mNW2e+w6DCXE4em8fhRzaRl5PpMaCGYXQUMxZGp8jOEqbuP4ip+w/iGydPZEdNPS+/u4XZS99j9jvvsXZbDX9bUs+bv32JX5x3JIeUFmW6yoZhdAAzFkaXUlyQy+mTR3L65JGoKq+v3MY3HnyNJZU7Ofs3L/KtUyfxX8dPsKy3htHLsH4BI22ICO87YDC3nDaEi47dn/pG5eZ/vc15015m5ebdma6eYRgpYMbCSDsFOVn85NzDuf9L72NEUT5zVm3jjF+/wB9fXmnTyBpGL8GMhdFtfHjSMJ6+4sOcO2UUNfWN/Oifb/L5e1+jckdNpqtmGEY7mLEwupXiwlx+df5U7rjwKAYV5vLC0s2c9svZ/GPuOmtlGEYPxhzcRkY48/BSyscN4upHF/LsW5u44s/zOGJ4HkeuXURBcP7wvGz65WbvNad4QW42K7fXM+i9XeTnZpOXnUV+bhb5OVnkZWchYs5zw+hqzFgYGWP4wH78/uJy/lqxlhtmLGbBpjoWbGpr0sQQM2dFFufneMORk01hdiPnbn2Hs48cxYHDB3RRzQ1j38OMhZFRRITzjh7D8ZOG8cdnXmfEqDFU1zVSU9dITb37rK5zc4kn5hWvqWtkW9VusnPzqa1vpLahidqGJuoamqhrbGredkmK4bZnl3Lbs0s5tLSIs6eM4mNHjmJ0SUFmv7hh9DLMWBg9gpHF/ThpXCFlZeNiyc+ZM4eysrJW5U1N6gxGfRO1DY1Mf+EN3qruz7/e3MDiyp0srtzJTU++RfnYQZw9ZRRnHl7K0AH5XfxtDKPvYcbC6FNkZQn9spyfA3KZMjKfL5cdyY/Pncysd97j8fnreXbJRipWbaNi1Taun76YD0wYwtlHjmJInc1BbhjJMGNh7BP0y83mI4eN5COHjWRXbQPPLN7I4/PXM/ud93hh6WZeWLoZgPEvP8/RYwdTPm4QR48bzNghheYwNwzMWBj7IAPyczh36mjOnTqabbvreHLRBp5YWMnrK7aw/L3dLH9vN3+uWAPAsIH5HD1uEOVjB3P0uMEcUjoww7U3jMxgxsLYpxnUP4/PHrM/nz1mf159vYL8kQdSsXIrr6/cSsXKbbxXVcsTCzfwxMINABTmZTOhJJtjKhdzSGkRh44qYsKwAZZR1+jzZMxYiMhKoApoBBpUtVxEBgN/Bsbh5uE+T1W3iesH+DVwJlANfEFV38hEvY2+S06WMGVMCVPGlPCVD41HVVmxeTcVK7fx2sqtVKzcysot1Szc1MjCTSuaj8vNFg4cPpBDSgdyaGkRh/hlcP+8DH4bw+haMt2yOFFVNwe2rwKeVdWbROQqv30lcAYw0S/HAHf6T8NIGyLC+GEDGD9sAOcdPQaATVV7+Pvzc6gvHM7iyp0sqdzJqq3VLPHrf2dd8/Eji/oxqQQ+k1vJ8ZOG2gyCRq8m08YizDnACX79fuB5nLE4B3hAXT6IV0SkRERKVbUyI7U09lmGD+zH0aP6UVY2sblsd20Db22oajYYSyp38taGKjbs3MOGnTD7T2+Qmy28f8JQTj1kOKccOoLSYhvnYfQuMmksFHhaRBSYpqq/A0YEDMAGYIRfHw2sCRy71peZsTAyTv/8HMrGDqJs7KDmsqYmZfnmXfzxmbks2ZlLxaqtzH7HTQj1o3++yeTRRZx6yEhOOXQ4h9qEUEYvQDKVvE1ERqvqOhEZDswEvg48rqolAZltqjpIRGYAN6nqi778WeBKVa0I6bwEuASgtLS0bPr06XxvPpAAACAASURBVB2qW3V1NYWFhV0mZzpN587aJuZU1vL6+j3M21BHbWPL/25oYRZHDM1m/JB+7FeUw5iiHIrzk+e46m3f3XT2XJ1hysvL56hqeeROVc34AlwHfAd4Gyj1ZaXA2359GnBBQL5ZLtlSVlamHaWioqJL5Uyn6QxSU9egzy7ZoFc9ukDLfzJTx145o9Vy5PVP6afufEmvenS+/v6F5Trr7U26fnu1NjU19ervbjp7ls4wQIUmeaZmpBtKRPoDWapa5ddPA24AHgcuBm7yn//0hzwOXC4ij+Ac2zvU/BVGL6VfbjYnHTyCkw4ewY1Nk5m/djv/eHEhe/JKWLqpiqWbdrG9up7XV27j9ZXb9jp2QH4OA3OV4S+/yMB+uRQV5FDUL5eB/dxnUUHLeuV7dRRtrKK4MJeSgjwL7zU6RaZ8FiOAx3wzOwf4k6r+S0ReB/4iIl8GVgHnefkncGGzy3Chs1/s/iobRteTlSVM3X8QTQf1p6zsCMC19jdV1bJ0465m47Fs0y6WbqxiW3U9u2qhcteOeCd4fnbzav+8bEoK8ygpzHVLgVsfVJiH7KphwOgqJgzrT062GRWjNRkxFqq6HDgyonwLcHJEuQKXdUPVDCPjiAgjivoxoqgfx00cute+rbvreOG1uYyZMImqPQ3srKl3n3vqW62v37ydhqw8tlfXs72mnt11jeyuq2Hd9uiZCW9/fTb5OVkcXFrE5FFFTB5dzORRxUwaOYD8nOzu+OpGD6anhc4ahtEGg/vnsV9RDkftP6hd2WBmXlVlV22DMxzV9WyvqWNbdT07quvYvKuOV99azfqabFZvrWb+mu3MX7O9WU9OljBxxEAmjyqisH43K1hLSYFvnfiWSnFBLrnWIunTmLEwjH0AEWFgv1wG9stlzODW++cMrqKsrIwd1fW8WbmDN9ftZNH6HSxat4Plm3c3jx8BYMH8yHMMzM9x/hHftZXfsJuyqnc5YGghY4f0Z9yQ/hTkWQult2LGwjCMZooLc/nAhKF8YEJL95cbdLiTRet28uriFeQPHMT26jq219Szo7qebdV17Kipp6q2garaBtZua+nmembFW3vpH1nUj7FDCjlgaH/GDunPAUMLqdpWz7At1RQV5DAgP8d8Jj0UMxaGYbSJG3Q4mLKxg5mcv4WysimtZJqalKrahmbjsbW6jv/Me5vGwiGs2rKbFVt2s2ZrtR/VvodXV2zdW8Ezz7WcLy+booJcH92V46K++uVQVJCL7trFe/kbGDe0kHFD+vt5S4zuwIyFYRidJitLKC5wvov9h7gBYUW71lBWdmizTENjE5U79rBi825WbtnNys3VrNyym3crt9IguVTtca0T54hvpHLHnshz/XHhnOb10uKWlsq4IYnWSn921zWxp76RnCwhO0tsTpIuwIyFYRjdQk52FmMGFzJmcCHHM6y5POiIb2pSdtUForxq6tnpP7fX1PPG2yupzurPqi3VrN5aTeWOPVTu2MMry7e2PuE//9W8mpst5GZnkZPlPnOzs8jxZY31tZS8/CL5OVnk52S7z9ws8rL9dm4W+TlZ7Nq6i3U56xk3pJBxQ/tTtI8lhjRjYRhGjyErS1z3U5IH8ZSCrc2GpaGxiXXba1i5pZqVm3ezYvNuVm3Zzcot1VRur6YJoaGxiSaF+kalvrEx6XlX74g3buXhN+c2rw/un9dsOMYN6e8/3XZfxIyFYRi9kpzsLMb6rqcPTxq2175wa6W+qYmGRqW+sYn6RqXBb9c1NjF/4SImTDyY2oYmahsaqa1voq6xZT1RvvjdNezJGcjKLbtZtaWarbvr2Lq7jjdWb29Vt6EFWUx9s4LJo4qZPLqIw0YVM6Iov1d3h5mxMAyjT5OVJeRnZZOf5Gm3vTiXI8eURO8MMKf/9r3GrWzcWet9L86Bv8r7YFZu2c3mmiZmLt7IzMUbm48fOiCPwwLGY/Ko4kSuu16BGQvDMIwUERFGFvdjZHE/jh0/ZK99jU3KjFmvwaD9WLRuB4vW7eTN9TvYvKuOWe+8x6x33muWLc7P4qw1C/noEaUcc8AQsrN6bsvDjIVhGEYXkp0l7FeUQ9mU0ZwzZTTgWiJrt9U447F+B2+u38midc6APPTqah56dTVDB+Rz5uEjOeuIUZSPHURWDzMcZiwMwzDSjIg0R4KdcXgp4AzIo/9+leX1JcxYUMnqrdU88PIqHnh5FSOK8jnz8FLOOqKUqWN6huEwY2EYhpEBRIQDSnL5VNnBfPcjB7Fo3U5mLFzP/y2oZO22Gu57aSX3vbSSUcX9OOPwUpp27WbhnhVkZwlZWUKWCNkiiLjWTJa48lVr9jDpsPoun/PdjIVhGEaGEREO36+Yw/cr5qrTD2b+2h383wJnONbv2MM9L65wgvMWx9J3xgf2mLEwDMPoy4gIU8aUMGVMCVefcQhz12znubc2sXzNeoYOG0aTKo1NrhursUlpVEWV5vWtW7d2uaEAMxaGYRg9lqwsoWzsIMrGDmLOnF2UlU1u95g5c+Ywoqhf19elyzUahmEYfQ4zFoZhGEa7dLuxEJExIvKciCwWkTdF5Ju+/DoRWSci8/xyZuCYq0VkmYi8LSIf6e46G4Zh7OtkwmfRAPyPqr4hIgOBOSIy0+/7pareEhQWkUOB84HDgFHAMyIySVWTZwUzDMMwupRub1moaqWqvuHXq4AlwOg2DjkHeERVa1V1BbAMeF/6a2oYhmEkyKjPQkTGAVOBV33R5SKyQETuFZHEjPSjgTWBw9bStnExDMMwuhjJVNZDERkAzAJuVNW/i8gIYDOgwI+BUlX9koj8BnhFVR/0x90DPKmqf4vQeQlwCUBpaWnZ9OnTO1S36upqCgsLu0zOdJpO02k6e6LOMOXl5XNUtTxyp6p2+wLkAk8B306yfxywyK9fDVwd2PcU8P72zlFWVqYdpaKiokvlTKfpNJ2msyfqDANUaJJnare3LMTN/nE/sFVVrwiUl6pqpV//FnCMqp4vIocBf8L5KUYBzwITtR0Ht4i8B6zqYDWH4lo5XSVnOk2n6TSdPVFnmLGqOixyTzIrkq4FOA7X1bQAmOeXM4E/Agt9+eO4bqjEMT8A3gXeBs7ohjomta4dkTOdptN0ms6eqDOVpdtDZ1X1RSAq3+4TbRxzI3Bj2iplGIZhtImN4DYMwzDaxYxFNL/rYjnTaTpNp+nsiTpjk7HQWcMwDKP3YC0LwzAMo13MWBiGYRjtYsYiAhHpJyL9e6q+3oqIdP2MLIbRAUQkT0Qm+6Xrp5XrRnzmivSfx3wWeyMiXwRuBuqBX2jrLLi5wH8Dx/uiWcBdqlrfEX1eZhJwJzBCVSeLyBHA2ar6kzbq+VFgGpANfEdVH0rtm0bqnAjkqmq8iX7b1nWNqt4Q2D4V9/0PD8ndFnW8qn4jQmdsWS//PuCXuIwB31fVZ0L7j486TlVnR5X7Yw4Ervc6/1dV5yWRGwpcAeQBt6vqmig5L9tlv6WIFAPXAR/yRbOAG1R1R4RsFW7MU3MRoKpaFJK7N+pcqvqlkNwnksj9PeLchyaRTXrvtXftReQzwKdx/6XfAoNxWSIeDMmdgBsYvBL3nccAF0f97iIyOEk9tyap42TgUKBfQPaBkMww4MoIuZNCcrHudxF5Q1WPipLtSsxYhBCROcBpwC7gP6paFtr/e9zNer8v+hzQqKpfSVWfiLyoqseJyCzgu8A0VZ3q9y1S1aRzKIrIq8CFwDZgZuJmEZFcVa0PPQgS41paPQgC+r4PXAzsBmap6rciZBI6C4AakjxcvOwdfv+PgFuBUuBSVV3u939TVX8tIsuBncAdQG3ieFW9P6ArtmyoDi/gHpxbgbs1lPNGRJqACmBD6BqdHaXPHzMTeMTrvEpVj0ki9w/cINMtwCdUNdIwednI3zKwvwGo9nXsB+wh+XV/FFjE3vfnkara6kEuInMT91tb+Ov+XX/+nwHfw1Xg0ZDcfYHNjwHTfT33Mipe9oWIU01W1UER5Ylj2rz2IvIO8H3g90A57j/3rKoeFpKbA3xWVd/225OAh0P/zWGq+p6/RzbScr/7r67jI+p3LXACzgg8AZwBvKiqnwrJPQ38GfgOcCnuf/eeql7p96d0v3u574TrE2WkO0U6Rvr15gV4I7A+O2L//DhlcfQBRf7zdf85N7BvXkfqCTzuP38C/Af4aNzvjTMCkqhPG7JzY+r8Ee5PdmnEvlf9Zw5wGfA68BUgqzOyKf6WZ+BSyfwVOBfITvH+eKGzcjHrOTdqPYmuVvdNsnsJNz3AkcDBQHHM+q0CDo1xnWLdI6FjUrlOrWQT5wz+H4PHBMoWtFcGPOM/v+Lvt8uAnHbqtxDXtT/fb4/AGf+w3JzwOYP/uVTvd9wLyb3AfYHl3lSvf3tLJiY/6pGIyHTcW/N4EXkc99CMaio3isgEVX3XHzceaJWnKqa+PwFnAZtFZIKXR0Q+BVQmqeftXm4/30wVIPiWMxJAVX/om7s/EpFvA9eo6kttXQNVrfHnqGlLjr27LiLxXRJvAs8AF4nIJn+OxNtOld9uAH7r30q/CfxHRG7RvbMKpyKL/74Aw/26EJHWXlWfBJ70XQ0/xb2VRs6VEuhiKRGRj+MeCq26KEQk0SooEJGp/tyR/qoYv2WCfC+fh7ufrlPV66J0AjUicpy6TAmIyAdxBjuKDcDtuJeEUhHZBnxRVStCcvX+Og7ETRHwWxH5gyZv0R1IxyZWi7yv4l57YLS/jqWB6xk1nUGF7yFIdE9diGthBhkIoKq/F5E/4h7aL4nIbZq8m7BGVZtEpEFEioBNuC6uMIku60rfBbk+9H1Sut+B1RrReutqrBvKIyIfjipX1VkhuZNxlns57mYci/uDPdcRfV52PG4gzQdwXRErgItUdWWE7MVJ9N7v99+kqlcFHloABwDXAGtU9ayQvoW4P+mBuImlBBinqgMjzp3Q+RDwWS+L+smsQrL3hcsIdEmIyHBV3RQ4P15fMTBaVbMDumLLevlrI86Nql4fkivy3+NsYCnubWx+1LFJvg+q+sWQ3HNJ5E6M0NnmbxmQuwt4P67786/ADuBUVT0jQucUXBdUMe4abQW+kOx7hY49DviVtu6uGw98DfdS9Cuv8xeqellILtFNuRX4lqo+1sa5ovwl/VS1lbM5hWsf93rm4x7+x/miF4A7VLU2IHOFqv4q5IcpBr7lVOqREfW8A/fCcT7wP7husHkR9TzLn3MMzlgXAder6uN+f6r3+4OqelHUd+9KzFgEEJGJqrpU3KRMY3FdAq0ukL/ZDvKbbwdvsoBMW29/yc7fH9fUrGpD5hJVbXeEZtyHloiMTSLXKmNvEp2qIcdcKqR4/tiyMc9dDczH9a3vCej7RUf0eZ3N2ZNjyMb6Lb3sZJxvbInf/oCq/qcN+SIAVd0ZR3/guPKIlkW3ICIvqOqH2pdsU0cBsL96f0QM+TwgP9l/Lq6hijhuHK6beUGceiTRkcp/YwRwtN98TVU3dfS8SetjxsIhIncDJ+Fm7RsFFOIu+uUhuViRB3HlvOz/Ajer6na/PQg3T/kPO6M35rl/p6pdHnonKUR4iciRtETvvNDG2/3+UeWqujok9xwRXRphoyYi1yWRuz5cJmmITOnq39LrTDVi7KO4+e2DUTk3JJFtL7osdjRUEv2zNSIQQETuB74Z+n/cGu56EZGPAbcAeap6gG9l3aChgAVxUyBcCNwG3ID77req6s/j1LON+h+uqgsD2/m4FsNVIbn7iL7vwt8n7v3+adz3fh7XAvkQ8N2I7qpOYT6LFj4AHILrZxwJNOHSpYeJypgbRaK/fC+SvLWeoarfD8hsE5EzgVbGAtdv2+pPGf5DiguhvJa9Q3yjQiijZ8WKIOr7+HNHfae78RFeXmaBiPwJ53gP6vwm8FUgUf8HvQG7PULn/+H+ZBL6PCIk9x2/70HcQyGSFFt+cwLr1+OubWeJ+1sei+uuOAQXipsN7NboyLZzcF2O7eK7twqBE3ERRJ8CXmvjkFu97q243zd87/wZWIy7Vs2RQ7T8tsFzB7tY8PLjkpz3iIShgOb/R1QU13U4n9PzXm6e70ILcxmuq+jf/px7cD6LVsZCRPbDXfsP+qIXcIZrbYTe+3331WwRORFnjKL8GzP85834yLIkxL3ffwgcnWhNiPNVPgOYsUgTNapaJyIPq+oeABHZEyF3kIgEjUgifDT8A2YDA4hnXLJFJD/RneWb0vlJZItxTvGg3qg/5L24EMrz/PbncL6W8MMp4VzdiyRvoj/CRcMk7YsOUKiqr4ns9fUbIuS+jJvoajeAiPwMeBn3Bw3X6XAvI8ApuDfcpyPk5ni5msR6FP5P9T1av1m36lbTvUN5rwj3gwc4QkSCXT9Jw4uJ/1v+Bvdw+yvuAf15YFKS829po25hPqCqR4jIAlW9XkRuBZ5sQ76/qj4LzV14YSbjpkQeAPyona6gs9rYFyZLRAap6jZ/7sFEP7vqVXVH6J5ripDbqaoVIvKu+vESSf7r4P4zf8KN3wC4yJedGiF7JvCYiGzA+SE+oapLw0LqQ45F5IcaCj8OycW633Fd18Fupy2kYcC1GYsWHgVQ1f+G5jfzqAFXK3Ax5O2xIVlzPoKHgGcD/aNfBB5IIhs38mGCqn4ysH29iER9nxr2fmtuUydumtuTca2UZ9qQjRvhJewdTdZI+wb2l7iQzx24P+9nk8i118f6EO5t+CwC8e5RguKc+wJMJRDzHsFCjTF2wRM7ikVVl4lItroZIu8Tkbm43yLMwf533oOLsnkJ+G3iBShEIkqqWkRG4R4ypWEhiR9d9jZwnoiUAb8QkfXAdaq6LkJ2ldc9nIChTsKtwMsi8le//Wmi57d5U0Q+i3v5mgh8Axc+HiYRoXiAtEQqHpDk3MNUNei3+IOIXBElqKobROQ03MvUP6MMRfiQdvYnaO9+f1JEngIe9tufoY35gTqKGQuPqv40tL0D+EKEaF1Mh+rMFM79MxGZj3t7KAYeUdXfJBF/M6bauCGUW+O+ifq3sO/6B8u1IvId3Bvk6xHil+EivA4WkXX4CK8IufuAV0Uk0Vo5F9cqaosTgKPUhSm+Et4pLZE2hf4tP9nb/RBVvUfcIKhZwCwRifou4B5YTbjQ0ciomw4Q97es9o7YeSJyM87oJntzPATXqi3A+d4+jetiirr2M0SkBNf98gbumt0dIZeIjLs7sN7qZUZaQoHBRQt+GBdlVhghezbumo7Cdf2OxY37OCwsq6oPiEgFzqcI7o09aqT313GzatbiHpxP4Vo6Yc7xn7cGylplVvBsEZGLaHkQX4Azqq0I3HfZwEkiciPRI+KbIxB9L0Wy3okEJ9DG/Y77X0+jJbrrd9pGJFpHMQe3R+KnNDge1y/6EV80E/ilJk/3sdebU9g55WV+jnsA/Rr31tCIG3naahR1XCRmCKXvUvlVTJ2JsSN4nROAgzQUyhc6Jk6E11EEwhhVdW479Wh2DCdzisZBRF5R1WP9W9ltuDfxv6nqhI7o8zrHqx+lHlN+LG5O+Wd892NO+Fp5mU24bohv4X7TO1R1Wcxz/Ez96OA2ZPJxoaut0oLERWKGrnrZ+biH/zOqOtX38V+kql9Oovs43HW6z3cfDlDVFTHqNBLn53lP/TgiXx4reshf+9uBY33Rf4CvaxvpW2LUKaWovvbud7F0H92LxE9pcBtuMNNFuBxRl+Hezi8NyX0M+AWhNycNpR7wsstwDva3cd0A9bjRnVGyqabxaDeEUlx0yUT2NmpReXJSGTtyKO5h8FdcxMkQ4MawIRCR0eFuChG5VFXvitDZ3GIgkP5CQ7H5vo/3QuAAVf2xiIzBzen+WkguKt79OlWdHnHuWI5OcckSv0xrP0hUyouvApcAg1V1gu86uUtVTw7LpkLowToUGBj1YPV1/RrOUCvwInBnuMtKXJqNT+ve0UiPqOpHSIK0H5Jaoarl3mhM9W/N8zV6/MK1OF/NQao6ybds/6qqHwzJRQVgfAvXKnhQfRiriJyHa009TzvRQ4GXyL38Skl+z0eBe4B/qWqUryQhFyvfVAr3+1rcsyasr8Mh4JFoFw8J760LMVMaEErNgWtyvhkhNx/3gEzInQjc09a52TutQ6s0Bb48VhoP3Gjxy3EpB6bhIiOmRsh9BZemYBvwHK5J++8UrttZOIfr2FD5PNzDdSXu4XkePs1BhNzBfv0gWhIzdua3TCSSW+K3BxGRwgTX8ioJbA8mSZoEXAvyi7iu2xxcF2VUKoe/4ro+3sW1Fp8Gfp1E5zzcW2/wd18YIVeFyxGUWKpwTtoondfixo2847dHAS8lkf0L7uF2ol/uxj2EW9UzoqxVOg9c4sQKfz+sxL1UfTfJuZ/BOcJ/g3uY/xqXOy3ZdZLQdYpK2VHpv39wWR8hNx8YHtgeRpKUPcA63P/tv3HBIZ8EPplE9hScH+xd4CaccYuSa/J1XY7rnl0BLO/E/R71va/tzH8o8jxdrbC3LrjxFd/2F/ol3IPz4gi5Cv85M1AW9QdPyM3H53Rp44asxoXpJj4X4kIjk9V1GK7b5Fngg0lk4j6sF+LegOf57YOBvyfR+XhomY4zMmcCI0OyC/znO4GyqAfMIcBcnBNvHnB8O7/TJ3BvUbcC5yaRiTK+UTm9ouoTmdOImDmXaHk5SHz/XOCVZPdc6Jgcoh+CsfMsEfPB6ssXxyybgxvoltgeS3TOpWW4FsBOnOEtjNLnZfvj/C45uNDPr+NaWFGyr4V+1/5JrlNUnaJ+44Wh7axwWWjfmTiD9hdcmHt7v0ExLmBiDc7QfBGXzTmxP1a+KVzYe6ulM/dHZxZzcLdwAS0pDT6FT2lAS/bOBL8Ul9n1VGju+/wprdkuIgOA2cBD4nIj7U5y7kPiVlJaUm78ARfBcYeItErjgTNQXxeRj6jqPf7YqOiZPaq6R0QQF777logcFCGXqGcwu67gWgVRkRcDxI0hyJGWfD6tuspUdYm4gWFP4tJOt5Ue/A5cWpKEs/FSETlVQ2kncLmMsmmJxBpGdAhl3JBMiO/oTPiutosbdb0BGJ5E5yxx2X4LxKVw/xrOAIfpJ27gYi1QqW37FepUVUUk8d3bmkflDRE5VlVf8bLH0DpHEjin8YvisiMnum2iBnKmEpJ6HS1+ugtw/7vxuG6jMH8RkWm4cSlfBb6Ec9qHGSsu59NWXCBCYpxCmH9J6+ihyJBhdd1JT4jISlzX9OXJZAFEZAguTP0i3EvQQ7huvotxjmo0fr6p54nIiox7pgSJHUzTGcxnkSb8n3QP7ke+EPe28ZCqRkZSpKD3uahybZ3GI5G2+BZcnpos3CjxCSG5x3BvPlfgfAzbcG9BZ0acu5UjLZlzTeLn80lEhgwE9sNFxKARkSEi8hZwiPqbVkSycF2Ah4TkLsQ9AI7CGftPAT9U1b+G5D6Py+WzV0imqv4x4txjieHoFJGv4MKwD8cZ9MSYg2kROrNwrb7TcPfJU8DvNfSn9L95IsKpFPcbRSX8Q1yE2kTcOICfev0PacQgRxFZguv6SwRd7I/zmzUQis7xvo/Ed39FVTdH6NuOe5Ad7z8FOE4j0o6n4qfz8qfirlMJrpXYKlpQRD5Jy/imA3Ah7oeqal6E7CfYO6giMnpI3MRC5+JaTfdpG8EX/r90EPBHL7shsK9Cfc4tiZlvSkTOwBmeXJzRma4udDojmLHwSMwh+L2FJA9rbev7eAd2Mc5BVxexvxbXtE68uc3ATdYTOZFNzHqOjSrX6Pw3M4DLtCVGfyzwG1VtNe5FRA7GjQcRXGTZkiTnTzjiwflqIiffievoFJGRwYdEW0gKuaFCx0Um/AvsTzxYwRmgfrhuoVnB65rs2icIybYbBJFiAMQbqnqUBObUaOPFo0PRgt7J/iAwFDcu6HlfHntCI3HzWSzDteo0IBv1MnMGLrDhg7iWbLKAgVgvUgH5RFbkqaoamRW5OzBj4fFvJRAagq9tjLBsR1/siYJS1DsE51cJRrDcEG6xSPx8Ov1w3Qov4AacTQL+ooEww9C5g29un8Z1R5yA65veHJBNJU0CEi/EeBYu3DER1XQ0rpm+wx9ztpdLaXazOIgbK7IK99a4EZoz7oaj5bolN5QkSfgnbpAZifp5jsO1bt8IGzKJETnlW0vfxLX85uFaGC9rxEh3iR+SWo17CAezHY9X1VbdZqm0QiRGIkH/0hOMwEv8N6MmNErlZeYvOH9Nokvps7gAik+HZSOOzVGXkjxYFjsrcndgxiKExJw9LIP6ZuKa+MFc/Ceo6intnTdJ2WO4qK16/EMXlzbhM0nOv1fSP9yfeDguimZxQG4mLk1CokvnIuDChK8nIBc5OCvJgyDyzTVB4g1WRFbQkkunFBctEvkwiIvvMjod1y2QjetmaNV3naKxiDXDmYg8raqnheWS6FxKa7/S3ao6MUL2WuKFpC7EGYBXVHWKb7X9r4Zm35PUQlJTeQjHaoX4e+nntJ9IsEv/kwG9i8Ot7CRl31LVXwa2TwBuCbcUJQ1ZkTuDObhb09XWs6v1lapqcFTqT8TNPRwmrvN2PK5FsQE/cRJJRhZLRNI/3IOoVZZW4qdJ+DHuTXWvwVlR51fVWXHeXFW1OXVDVz0YNL6jMx25oYalUNWqcLePb+VG8XHcb/8GrpLrRaTVPCbED4L4ATET2kUZhTY4WNxI5+CI5yjDfy3xEgkWi8g5uK6l9bhWcVTeslSJGzAwXpzD/gbcmK5iWnJPBbkZdz/k+iWjmLHwSOpD8NvTl8glVCCBiYg0YqKgFHlaRM7HhfGBc94+FSEXN59OvboBUbf7B2KimR5FsqR/UWmxY0cPqeoWEckSkSxVfU5EIkeUR7y53i4iSVMx+z7rVs7NjiB7Ozp/rckdnenIDZXKC8dhvtsm6FdKlnspbuTUWnFpQf4BzBQ3o17Uwz5dCe3iRgvGTSQ4GzdeIpESZayIfDWqpZgiZbjZ7PYKw8rDcwAADo1JREFUGEg8WxLPEnVRit/Ajce4VFX/EKVMU5wPJ92YsWjhv3HO2wSCm6O5o9yK+5NvYO+8Mx2eKMjzVVzkUqIbKgvYLSL/ReANVuPn07ndy/8YQFwCxWRJyFJJ+vclr/uXuOuQiDcPk0qIcaw3V3FpScA9ZP6URFeq3IUzFGOAExIPpI6+THji5oY6MoXWyiha+5UOEpemZi+/EtEhqa1yQ6nqx/3qdeIis4qBf0WcOyoktdMJ7VJohcRKJKiqXwhui5sG9jHazrgbh9PjCEnLSPPZwPcSPrZw95KkkBW5OzBj0cLtwAWBJvbv8GGcHeRHuLj523F+hcm4MM1OoRHTnbYhuxg3v0BbMveHtneQvJ730Trp3z1J9K7COebaYx5uMGJiQppiXPM8irhvrrfgDNSRgIrIkV3gGEyWlTTMQhEpaS+wAEBVL4rZrZbtHyh7RSNFoS2BDptwI4Sf9S3lE4HNfknI3iIucmonLuTzGlVtFbMvgQmywl1coXN/V1ygSMLnkZaEdm0QK5GghII/cPfR+s6ePAWjlvgPC651k+w/HTsrcndgDm6PiCTeQp/H/bG+oW0MEIuhby5wBy7k7TLcDXxNXOdnO7rPpmVSo+dVdUZb8u3oGgZciUsP0u7bi8RM+ictUTl7EeFsTMUh/HPcxC/BN9cFGkqS530rX8H1/Quubz7ZhEpdStzAAl8ea4YziY5G+o92ModUXFL5jXoCvqWKqu5Ksj/2b5QuROSLuLQ0X1TVh5PIzFHVMnHzjRzhy15X1aOj5NONtSw8msJI4pg0qerdIvIdVf0zgLhpPDuFiNyEexNNhOd9U0Q+qKpRo7PjkHh7+Sgx3l68zyWO3yU82jsZsWcU9G+uwcFUyd5cvwwcG+FbSbuxILVR4XFnOPsmLdFIJ4qPRupsRSV+eHesCbJEpAHXSmwuSqKvSxGRa1T1BhE5HJc6fbAv34xL2bModEgqv1E66vtTXNTfkcDP/MvfFaq6MSSayAZQ6Z9N6/HfLROYsfDI3iOJHxSRH0Cn+qQTffvnef1ZJO/fT4UzgSkBZ/T9uLQCHTUWqczpkAqtonKSEHtGQRG5zjv92pvTuSMTKnUVcQMLIH63WiopWWKT6NKM8VYdd4KsVJz7XclZuK7LacC3VfU5aA5JvYuWl4sEqfxG6aBeVRMTGJ3rjcVTwJSQ3E+8D/F/aMmKHDnxUndgxqKFVKZ5jMOZAIG+8kKi8+l0hBJctAu4Pv7OkK63lyPFpX8Izth2vbZOE5HKjIJn43IKtUds30pXk0JgAbSe4ex8oh3CcaOROlztdvbHnSArU33ae8SF/A5IGAoAVX1eIkKBU/yNuhxVvSa0/biIRE2X+mngRd8yOtG3gG4hOn9Y2jGfRS9DRC7ApT9+Dve2fDxwVaKrqwP6ouZ0uF5VI30OKerOoiU88TzcnM8fDcncrKptTVoflI2dtz+ubyWTiMhhOMdycz2B7cEHXsQxbaZkSfH8CT/EQ7iRwolR6W+E5GJNkCUud9cFhFpxYX1djYh8GTgGdw+/REuk4EW4rMydiWrscqQlu0EiC0OyuVEy7lvZ69xmLHofIlLK3hE0sXIRpaC/n0bP2dxZvV/vjJNZRCpxTsHwwyhqUGCPR0QW4Ua434wzqj8DylX1/d10/iijpOHgBnFZg//tI+XwrZwTVPUfIbnnad26aKUvHYjIl3At9+G4+2MHbtqBayJ8ARlF4mc3mI+7zkHfyixVPbw769tcHzMWvQtxuW8mqOoicYPzhgIPaBsz4bWj75pgN5CInIKbJrZTN6RI82x1473zcX/cnBevtXNoWzoz9laVDsQNgvsZbjDXQNwb/s+0jVnWMoGIzFPVKaGyPvVbdCdJrmdUWeysyN2B+Sx6H/8ARojIBlwsfRXuZko6xWU7jBSRO3HjQm7F5VI6p+1DYnEHbvTsSTjnYxUudXdnwv66JW9/N1KPcx4X4MKWV3SnofBjPP4XGKWqZ4jLwPt+9fOfBIhyurd6dojINRFypOCT6hQiMgnX8hyhqpNF5AjgbFX9SXecPwViZTfItG8lTFcMxTe6lzG4eRoOV9XPq5v4Z2Q7xyRFVb+Gc0CvwWUSPU1Vl3dBPY/xddvjz7ONTqbeUNXviciRInK5X1rN1/z/7d1/qN11Hcfx58so1xImJEiWbrbpTGWF7drEP0It/8qi2MCFkqRMaDVqIATZ6A9HBaZkZlpWEoXkLFC0QGENc2RujqVuNiuHzdLYhmZ6ac3t3R+fz9k599zvPed7z/meH9+71wMu957vzvmez2Xnns/5fD+fz+tdM9tIncUEaY/F6pYVOsNwD2kVzmn59vMUr7bZLukWSYvz1y0Ur45aQ9p93/41LD8irQo8DBCp5vaVQ3z+sj5PmsN7hRRyuZLidAMiYndE3J6/RtZRgEcWdXSYtBrqoNLu4L6WhOZ9C7tI6/uvUorbmJZ82ks7Va5a3Wzauo70hnQsyFBpd/Ew9k8MwrXRjBl/GfiUpKuH+PynRMR9yhUUI+ItSUXFdb5EGnk2FlE8Stpo2m5/RHxnME0tZX5EPKmp2VBVBARW7R/Rtjm1DtxZ1M8Cmp/qGqtM+pl4ahQOakRBXEFx8uls3UbK2zlV0kZytbo+z3kdxUGGtewsoqAexZCvR7+pVKOk0aGvoBlT39qmN4GvljjfqCdAD0haTPP3WUnqhMfNk6SrA7XizqJmImJRxac8gYIiSRWc97ekvQCNGhTX0iWnqoRRbrabi9YDDwKLJW0lRaGvbL+TygfavV8FMS9D/BS9lpTpdo5Ssaq9pEUW46aWr1l3FjUj6e2khNxj2VDAXRFxeMYHdbYsmoFqRMSrkqpY5fIUUwsQXZdv91yAiBFutpuj1pE686Wk/6c9M7yOygbaVbEwoh9fJoV/7iV9CJok7X5e1+lBI3BC0SXk6KOS4zB46WzNSLqbVAilsaP2auBIRJTJYSo638DXcle1zFLSe4FTaW5iexy4MCLu7PfcxyOVDAjUmAXazUTSi6S5lfY34TK7z4dGqYDWUaa2M6KPSo7D4JFF/UxEROsqoM35Db9XA83JUYUFiICHgSsj4jalML276C9G3sopFQmjZjAhNN8IIwYcJNjiYET8bEjP1bMBXEoeCncW9XNE0uKI+BuAUtnIohUspQxqLbcGU4BoNXBv3incd4y8HStX2jBTdciygXbfJb2ONkbEw4NocBfnSNrJ1Dyy7w8ijaAfA7iUPBS+DFUzki4lrY9/gfTHvZCUiT9jntAo5Ayjo8BLEbG3wvOeRqrStrHXPCxLJC0sOh5tRXwk/ZTiEcO0gk55MvzrpMnwDRGxtboWd5Z/n7fRzCNbRVpOW1jTfVSqvpQ8LO4sakbSPcD1pElJSJOSM9XMnlM0NUb+feRLUAWfhK1CStXvIGVY3UAzcPBXbfdrnf84E9gA7IuIqhOdS5P07WgrjjVqkv7Udim58Ni4cWdRM2UnJeeisp+EbTC6LVSYIZiQiLhkcK2qH0k7gFVtl5LvH/e/a89ZWG24Uxi5jp8s3SmUdgPwO0mNWJ1FzBD3MU48sqgZSZPAX1sPUTwpaVaJlst/S0ivvcLXnApKr8L08qvHO0nzSIsFLgNeI2WE3TpuE/Ht3FnUjC/F2LDNYiL8RdI8Rfv9xmqfw6hJug94nbTZEVLhqZMjYtXoWtWdOwszq8TxPJ82G5J2R8S53Y6NG89ZmFlVarHPYQzskLQiIp4AkPQRYFqo5LjxyMLMKlGXfQ6jJuk50tL3v+dDZwB7SHHqYzv/6M7CzAZmHPc5jNpMc0AN4zr/6M7CzCqRQyinGfc0VSvHnYWZVULSUeBfpFKxrbEgY52mauW4BreZVWUN8BIpyfisiDjTHcXc4c7CzCoREXeTao2cCGyVNI5V6qxHvgxlZpWQ9JmWmwuAr5AuQ411QJ6V430WZlaVK5iaH/XUqBpi1fNlKDOrykZSYaSPA5eTRhffHGmLrDLuLMysKptIhbmWAmfnnzd1uL/ViDsLM6vKq8CjETEZEZPAI8DBEbfJKuIJbjPrS663HqSIj/k0I/SXAG8ArwBExCdH0kCrhCe4zaxfN+fvXwAeB57Otz8IXATcOYpGWbU8sjCzSrTXkZYkYKeXzs4NHlmYWVUek7QZ2JxvX0qKKbc5wCMLM6uMpIuBCVI21PaI+P2Im2QVcWdhZmZdeemsmZl15c7CzMy6cmdh1oWkr0naJelpSTtzzeRBPdcWScsHdX6zXnk1lFkHki4CPgFcEBGHJJ0CvGPEzTIbOo8szDp7D3AgIg4BRMSBiPinpA2Stkl6VtIP856CxsjgVknbJT0naULSryX9RdJN+T6LJP1Z0i/yfe6XNL/9iSVdLukPknZI2iTppHz8W5J255HOze2PMxsEdxZmnT0CnC7peUl3SPpoPn57RExExPnAO0mjj4b/RcRy0s7lB4C1wPnANZLene+zFLgjIj4AvE7a/XxMHsHcCHwsIi4AtgPr8+M/DZwXEcuAmwbwO5tN487CrIOIeAP4MKlk6H7gl5KuAS6R9EdJz5A2n53X8rAH8/dngF0R8XIembwAnJ7/bV9ENDas/ZxUYa7VCuBcUsW5ncDngIXAv4H/Aj/OxYYmK/tlzTrwnIVZFxFxBNgCbMmdw/XAMmB5ROyT9A1gXstDDuXvR1t+btxu/M21b3Bqvy1Sguvq9vZIuhC4DFgJfJHUWZkNlEcWZh1IWirprJZDHwL25J8P5HmElT2c+ow8eQ7wWVIAX6sngIslLcnteJeks/PzLYiI35DKljp3yYbCIwuzzk4CvifpZOAtUvz2GuA14FlS/Pa2Hs67B1gr6SfAbuAHrf8YEfvz5a57JZ2YD98I/Ad4QNI80uhjfQ/PbTZrjvswGzJJi4CH8uS4WS34MpSZmXXlkYWZmXXlkYWZmXXlzsLMzLpyZ2FmZl25szAzs67cWZiZWVfuLMzMrKv/A6eG1rKRmcYNAAAAAElFTkSuQmCC\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "needs_background": "light"
          }
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<matplotlib.axes._subplots.AxesSubplot at 0x7f721821d910>"
            ]
          },
          "metadata": {},
          "execution_count": 47
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "symbol_frequencies = nltk.probability.FreqDist(text)"
      ],
      "metadata": {
        "id": "ZZcDueCjK-yZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "if ' ' in symbol_frequencies:\n",
        "  symbol_frequencies.pop(' ')\n",
        "symbol_frequencies.most_common(5)"
      ],
      "metadata": {
        "id": "sxW4PiRuK-oo",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "58701b83-96fa-41d1-e70d-97476277cfab"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('о', 115661), ('е', 91826), ('а', 81719), ('н', 68866), ('и', 65571)]"
            ]
          },
          "metadata": {},
          "execution_count": 38
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "symbol_frequencies.plot(30, cumulative=False)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 297
        },
        "id": "6iWl8bl9f_gM",
        "outputId": "af0ab3b4-0f5d-4729-ab4c-6037e503aab1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAEHCAYAAABiAAtOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deXwV9b3/8dcnO0kgIQloZEtYFAE3EgHXulXRWm17u9i6r7dX/dXW2161m7ba3ra31S5WW6vWtS51qYKoVRQ3ZElABUEgLLLInrCELJDk8/tjJngMJyGEk5ws7+fjMY9z5jvf+c5nkpPzycx85zvm7oiIiMRSQrwDEBGR7kfJRUREYk7JRUREYk7JRUREYk7JRUREYk7JRUREYq7dkouZ3W9mG8xsfkTZ/5nZR2b2gZk9a2bZEctuMrMyM1tkZmdElE8My8rM7MaI8kIzmxmWP2FmKWF5ajhfFi4vaK99FBGR6Ky97nMxsxOBSuAhdx8Tlp0OvObudWb2awB3v8HMRgGPAeOAg4BXgYPDphYDnwdWA7OBb7r7AjN7EnjG3R83s78A77v73WZ2NXC4u3/bzM4Dvuzu39hbvHl5eV5QUNCmfa2urqZXr14xras21abaVJudrc1oSktLN7l7vz0WuHu7TUABML+ZZV8GHg3f3wTcFLHsZeCYcHo5ovymcDJgE5AUlu+u17hu+D4prGd7i7WoqMjbqqSkJOZ11abaVJtqs7O1GQ1Q4lG+U9vtyAUgPCU12cMjlybLJgFPuPsjZnYnMMPdHwmX3Qe8GFad6O5XhOUXAuOBW8L6w8PyQcCL7j4mPA030d1Xh8uWAuPdfVOUGK4CrgLIz88vmjRpUpv2s6qqivT09JjWVZtqU22qzc7WZjTFxcWl7l68x4JoGSdWE80cuQA/Ap7l09NydwIXRCy/D/hqON0bUX5hWDcPKIsoH9S4HWA+MDBi2VIgb2+x6shFbapNtak29x3NHLkktSlV7QczuwQ4Gzg1DAxgTZggGg0My2imfDOQbWZJ7l7XpH5jW6vNLAnICuuLiEgH6dCuyGY2Efgf4Bx3r4pY9DxwXtjTqxAYAcwiuIA/IuwZlgKcBzwfJqXXCY5sAC4Gnoto6+Lw/VcJOhBodE4RkQ7UbkcuZvYYcBKQZ2argZsJLsanAq+YGQTXTb7t7h+Gvb8WAHXANe5eH7ZzLcFF+kTgfnf/MNzEDcDjZnYbMJfgVBrh68NmVgaUEyQkERHpQO2WXNz9m1GK74tS1lj/F8AvopRPAaZEKV9G0HW5aXkN8LV9ClZERGJKd+jHQF2DzrqJiERSctkP7s6F983kon9tYFvNrniHIyLSaSi57Aczo2pnPbX1TumKiniHIyLSaSi57KdxhTkAzFxeHudIREQ6DyWX/dSYXGYt1600IiKNlFz2U9GQviQAH6zeSvXO+niHIyLSKSi57Kc+ackUZCdR1+DMXanrLiIioOQSE6P6pQC67iIi0kjJJQYak8ssJRcREUDJJSZG5gXJZc7KCnbWNcQ5GhGR+FNyiYGs1ARG9M+ktq6BeWu2xDscEZG4U3KJEd3vIiLyKSWXGPn0fhclFxERJZcYaUwuJSsqqNdAliLSwym5xEh+Vi8G56RTWVvHwrXb4h2OiEhcKbnEkK67iIgElFxiSOOMiYgElFxiaHzERX13XXcRkZ5LySWGBuekc0CfVCqqdlG2oTLe4YiIxI2SSwyZGeMKcwFddxGRnk3JJcZ0v4uIiJJLzOm6i4iIkkvMDe+XSd/0ZNZtq2FVeXW8wxERiQsllxhLSDCOLmi830VdkkWkZ1JyaQe67iIiPZ2SSzsYH/YYm7VCyUVEeiYll3ZwaH5vMlOT+HhzFeu21sQ7HBGRDqfk0g6SEhMoGtIX0NGLiPRM7ZZczOx+M9tgZvMjynLM7BUzWxK+9g3Lzcz+aGZlZvaBmY2NWOfisP4SM7s4orzIzOaF6/zRzKylbXQ0jTMmIj1Zex65PABMbFJ2IzDV3UcAU8N5gDOBEeF0FXA3BIkCuBkYD4wDbo5IFncDV0asN3Ev2+hQjfe7zF5eEY/Ni4jEVbslF3d/E2h6Tuhc4MHw/YPAlyLKH/LADCDbzPKBM4BX3L3c3SuAV4CJ4bI+7j7DgzsVH2rSVrRtdKjDBmaRmpTAovXbqdixMx4hiIjEjbXnXeRmVgBMdvcx4fwWd88O3xtQ4e7ZZjYZ+JW7vx0umwrcAJwEpLn7bWH5T4BqYFpY/7Sw/ATgBnc/u7ltNBPfVQRHSuTn5xdNmjSpTftZVVVFenr6HuU/nVbOhxt3csOx2YwbkNZi3da2uT911abaVJtqc3/rNlVcXFzq7sV7LHD3dpuAAmB+xPyWJssrwtfJwPER5VOBYuD7wI8jyn8SlhUDr0aUn0CQxJrdxt6moqIib6uSkpKo5b97+SMfcsNkv3XSh3ut29o296eu2lSbalNt7m/dpoASj/Kd2tG9xdaHp7QIXzeE5WuAQRH1BoZlLZUPjFLe0jY63Djd7yIiPVRHJ5fngcYeXxcDz0WUXxT2GpsAbHX3tcDLwOlm1je8kH868HK4bJuZTQhPfV3UpK1o2+hwY4dkk5RgzF+zlcrauniFISLS4dqzK/JjwLvAIWa22swuB34FfN7MlgCnhfMAU4BlQBnwN+BqAHcvB24FZofTz8Mywjr3hussBV4My5vbRodLT0lizIAsGhxKP1avMRHpOZLaq2F3/2Yzi06NUteBa5pp537g/ijlJcCYKOWbo20jXsYX5vDeqi3MWr6Zzx3cL97hiIh0CN2h3840iKWI9ERKLu2seEgOZvD+qq3U7KqPdzgiIh1CyaWdZaUnM/LAPuysb+C9VVviHY6ISIdQcukA43VqTER6GCWXDqDrLiLS0yi5dIDGxx6XflxBXUP7DbcjItJZKLl0gH69UxnaL4PqXfUsq9gV73BERNqdkksHabzusmCTkouIdH9KLh1kwtBgnLFpK6qpq2+IczQiIu1LyaWDnDH6QAbl9GLVtjoen70q3uGIiLQrJZcOkpacyA/PPBSA219ZzNZqnR4Tke5LyaUDTRxzIKPykinfsZM/TV0S73BERNqNkksHMjMuPbIPZvDA9BUs21gZ75BERNqFkksHG9o3ma8VDaSuwfnllI/iHY6ISLtQcomD759xCBkpiby6cD1vL9kU73BERGJOySUO+vdO45pThgNw6+QF6posIt2OkkucXHZcIQP79mLR+u3qmiwi3Y6SS5ykJSfyw7PUNVlEuicllzg6c8yBjCvIoXzHTu58TV2TRaT7UHKJIzPjp18ctbtr8vJNO+IdkohITCi5xNmYAVl8rWggu+qdX7ywMN7hiIjEhJJLJ/D909U1WUS6FyWXTqB/nzSuPlldk0Wk+1By6SQuP76QAdlB1+QnStQ1WUS6NiWXTiKya/Lv/r2YHTt19CIiXZeSSydy1mGfdk1+aqEGtRSRrkvJpRMxM35ydtA1ecqSKlZXVMU7JBGRNolLcjGz75nZh2Y238weM7M0Mys0s5lmVmZmT5hZSlg3NZwvC5cXRLRzU1i+yMzOiCifGJaVmdmNHb+HbXfYwCy+ePhB1Dk88M6KeIcjItImHZ5czGwA8B2g2N3HAInAecCvgTvcfThQAVwernI5UBGW3xHWw8xGheuNBiYCd5lZopklAn8GzgRGAd8M63YZV504FIDHZ69iW42GhRGRridep8WSgF5mlgSkA2uBU4CnwuUPAl8K358bzhMuP9XMLCx/3N1r3X05UAaMC6cyd1/m7juBx8O6XcaYAVmM7pdCZW0dT8xSzzER6XrM3Tt+o2bXAb8AqoF/A9cBM8KjE8xsEPCiu48xs/nARHdfHS5bCowHbgnXeSQsvw94MdzERHe/Iiy/EBjv7tdGieMq4CqA/Pz8okmTJrVpf6qqqkhPT49p3XeWb+X2kmryeiXw57P6kZRg+91me8SpNtWm2uw5bUZTXFxc6u7Feyxw9w6dgL7Aa0A/IBn4F3ABwdFGY51BwPzw/XxgYMSypUAecCdwQUT5fcBXw+neiPILgTv3FldRUZG3VUlJSczrzpo920/+7es+5IbJ/q+5q2PSZnvEqTbVptrsOW1GA5R4lO/UeJwWOw1Y7u4b3X0X8AxwHJAdniYDGAisCd+vIUg2hMuzgM2R5U3Waa68S0kw44rjg2sv9761vDFRioh0CfFILiuBCWaWHl47ORVYALxOcNQBcDHwXPj++XCecPlrYbZ8Hjgv7E1WCIwAZgGzgRFh77MUgov+z3fAfsXcV8YOIDcjhXlrtjJzeXm8wxERabUOTy7uPpPgwvwcYF4Ywz3ADcD1ZlYG5BKc5iJ8zQ3LrwduDNv5EHiSIDG9BFzj7vXuXgdcC7wMLASeDOt2OWnJiVwwYQgA9761LM7RiIi0XtLeq8Seu98M3NykeBlBT6+mdWuArzXTzi8IOgY0LZ8CTNn/SOPvwmOGcPcbS3l14QaWbqxkWL/MeIckIrJXukO/k8vLTOU/xg4A4L63l8c5GhGR1lFy6QIuDy/sP126ms2VtXGORkRk75RcuoDh/TM5ZWR/ausaeGTGyniHIyKyV0ouXcQVJxQC8PCMFdTsqo9zNCIiLVNy6SKOGZrL6IP6sKlyJ/+a2+Vu2xGRHkbJpYswM648Ibyp8u3lNDTopkoR6byUXLqQLxyeT35WGmUbKnlj8cZ4hyMi0iwlly4kOTGBS44tAOBvuqlSRDoxJZcu5rxxg8lISWT60s18+MnWeIcjIhKVkksXk9UrmW8cPRgIBrQUEemMlFy6oEuPKyDBYNL7n7B2a3W8wxER2YOSSxc0KCedMw/Lp67BeWD6iniHIyKyByWXLqqxW/I/Zq6keldDnKMREfmsfU4uZtbXzA5vj2Ck9Y4clM3RBX3ZXlPHP+ZXUlunu/ZFpPNoVXIxs2lm1sfMcgiew/I3M7u9fUOTvbn65OEATCmr4vO3v8lL89fpiZUi0im09sgly923AV8BHnL38QSPK5Y4OvmQ/jx8+TgG9UliZXkV336klG/+bYa6KItI3LU2uSSZWT7wdWByO8Yj++iEEf343edzufXc0fRNT2bGsnLO/tPb3Pj0B2zcruH5RSQ+Wptcfkbw2OAyd59tZkOBJe0XluyLxATjwmMKmPb9k7nsuEISzXh89ipO/u007p62VKMoi0iHa21yWevuh7v71QDuvgzQNZdOJis9mZ9+cRQvf+9ETh3Zn8raOn790kd8/o43eHHeWl2PEZEOk9TKen8CxraiTDqBYf0yue+So3lz8UZue2EBi9dX8l+PzqEwO4mi5e9TkJvOkNwMCnIzGJybTlav5HiHLCLdTIvJxcyOAY4F+pnZ9RGL+gCJ7RmY7L8TD+7HlGEn8NjsVdz+70Us37KL5aWr96iXk5HCkNx0huQESWdovwwO0JD+IrIf9nbkkgJkhvV6R5RvA77aXkFJ7CQlJnDhhCF8+agB/HPqLFJzB/Lx5h2s2LyDjzdXsWLzDsp37KR8x07mrtyye73xA1J54ug4Bi4iXVqLycXd3wDeMLMH3P3jDopJ2kFmahKHH5BKUdHgz5S7Oxu217Ji06fJ5sHpK5i5ppa3l2zi+BF5cYpYRLqy1l5zSTWze4CCyHXc/ZT2CEo6jplxQJ80DuiTxvihuQBkpiXxm5cWcevkBbzwneNJStQoQSKyb1r7rfFPYC7wY+AHEZN0Q5cdV0j/9EQWrd/O47NXxTscEemCWptc6tz9bnef5e6ljVO7RiZxk5acyEVHBJfYbn9lMVurd8U5IhHpalqbXCaZ2dVmlm9mOY1Tu0YmcTVhQCrjCnMo37GTO1/T/bIism9am1wuJjgNNh0oDaeStm7UzLLN7Ckz+8jMFprZMWHCesXMloSvfcO6ZmZ/NLMyM/vAzMZGtHNxWH+JmV0cUV5kZvPCdf5oZtbWWHsqM+OnZ4/CDB6YvoLlm3bEOyQR6UJalVzcvTDKNHQ/tvsH4CV3HwkcASwEbgSmuvsIYGo4D3AmMCKcrgLuBgiPnG4GxgPjgJsbE1JY58qI9SbuR6w91pgBWXytaCC76p1fTlkY73BEpAtpVW8xM7soWrm7P7SvGzSzLOBE4JKwjZ3ATjM7FzgprPYgMA24ATiXYCRmB2aERz35Yd1X3L08bPcVYKKZTQP6uPuMsPwh4EvAi/saq8D3Tz+EFz5YyysL1vNO2SaOG66uySKyd9aa8abM7E8Rs2nAqcAcd9/nGynN7EjgHmABwVFLKXAdsMbds8M6BlS4e7aZTQZ+5e5vh8umEiSdk4A0d78tLP8JUE2QlH7l7qeF5ScAN7j72VFiuYrgaIj8/PyiSZMm7evuAFBVVUV6enpM63amNp9ZWMmj8ysZ3CeJ334+l8QEi1ov3nGqTbWpNtu3zWiKi4tL3b14jwXuvs8TkE1wWqst6xYDdcD4cP4PwK3Alib1KsLXycDxEeVTwza+D/w4ovwnYVkx8GpE+QnA5L3FVVRU5G1VUlIS87qdqc3qnXV+7P9O9SE3TPZHZqyISZuxqKc21aba7Ng2owFKPMp3alvvjtsBFLZx3dXAanefGc4/RTAA5vrwdBfh64Zw+RpgUMT6A8OylsoHRimXNkpLTuSHZx0KwO/+vZhtNeqaLCIta+1jjieZ2fPh9AKwCHi2LRt093XAKjM7JCw6leAU2fMEvdIIX58L3z8PXBT2GpsAbHX3tQTPlzndzPqGF/JPB14Ol20zswnh6bWLItqSNjrrsAM5uqBv2DW5LN7hiEgn19rhX34b8b4O+Njd9xxet/X+H/ComaUAy4BLCRLdk2Z2OfAxwVMvAaYAZwFlQFVYF3cvN7NbgdlhvZ97eHEfuBp4AOhFcCFfF/P3U9A1eTTn/Plt/v7Ocr41bjAFeRnxDktEOqlWJRd3f8PMDgAax8ndr7vq3P09gmsjTZ0apa4D1zTTzv3A/VHKS4Ax+xOj7OmwgVn8x9iBPFW6ml9MWcjfLor2KxQRaf1psa8Ds4CvERxRzDQzDbnfA/3PGYeQnpK4u2uyiEg0rb2g/yPgaHe/2N0vIrhp8SftF5Z0Vv37pHHNycMBuHXyAur16GQRiaK1ySXB3TdEzG/eh3Wlm7n8+EIGZPfio3Xbmbq8Ot7hiEgn1NoE8ZKZvWxml5jZJcALBBfapQdKS07kprNGAvDY/ErWblWCEZHPajG5mNlwMzvO3X8A/BU4PJzeJbjLXnqoLxyWzzFDc9lW28BF982iYsfOeIckIp3I3o5cfg9sA3D3Z9z9ene/nuAel9+3d3DSeZkZd18wlsF9kliyoZJLHphNZW1dvMMSkU5ib8nlAHef17QwLCtol4iky8hOT+EnJ/ZlUE4v3l+1hf98uITauvp4hyUincDekkt2C8t6xTIQ6ZpyeiXy8GXjyctM5Z2yzVz32HvU1TfEOywRibO9JZcSM7uyaaGZXUEwmrEIBXkZPHz5OHqnJfHSh+v40bPzGwcNFZEeam936H8XeNbMzufTZFIMpABfbs/ApGs5NL8Pf7/kaC64byZPlKwiOz2Zm8LBLkWk52nxyMXd17v7scDPgBXh9DN3PyYcgFJkt+KCHO6+oIikBOOvby7j7mlL4x2SiMRJax9z/Lq7/ymcXmvvoKTrOvmQ/vzu60dgBr9+6SMem7Uy3iGJSBzoLnuJuXOPHMDPzw3GDf3Rs/OYMm9tnCMSkY7W2iH3RfbJhROGsGXHTn73ymKue3wuvdOSaNtDVEWkK1JykXZz7SnDqajaxf3vLOc/Hy7lrGFpLPVVDOzbi0F90zkwK43kRB08i3RHSi7SbsyMH3/hULZW7+LpOat5auEOnlr4we7liQnGgX3SGNi3FwP7poevvUirrKMojnGLyP5TcpF2lZBg/Po/DuPkkf2YNncx9WnZrK6oYnVFNeu21bBmSzVrtlQzc3n57nUSDfoP3sz4oblxjFxE9oeSi7S7pMQEzj78IPJ3raWo6Mjd5bV19azdUsPqiurdCee9VVt4u2wT/++xubzwnRPo1zs1jpGLSFspuUjcpCYlUpCXQUFexu6yuvoGvnjHqyzcVMt1j8/l4cvHk5hgcYxSRNpCV1OlU0lKTOD6CdnkZaYwfelm/vDq4niHJCJtoOQinU5Or0T+cN5RmMGfXi/jjcUb4x2SiOwjJRfplI4bnsf3TjsYd/ju43P5ZIuedinSlSi5SKd17cnDOfHgflRU7eLaf8xhl4byF+kylFyk00pIMH7/jSPJz0pjzsot/PrFj+Idkoi0kpKLdGo5GSnc+a2xJCUY9769nJfmazBuka5AyUU6vaIhfbnxzJEA/OCf7/Px5h1xjkhE9kbJRbqEy48vZOLoA9leW8fVj86hZld9vEMSkRYouUiXYGb85muHMyQ3nQ8/2cbPJi2Id0gi0oK4JRczSzSzuWY2OZwvNLOZZlZmZk+YWUpYnhrOl4XLCyLauCksX2RmZ0SUTwzLyszsxo7eN2kffdKSuev8saQkJfDYrJU8O3d1vEMSkWbE88jlOmBhxPyvgTvcfThQAVwell8OVITld4T1MLNRwHnAaGAicFeYsBKBPwNnAqOAb4Z1pRsYfVAWPztnNAA/fGY+K7fuinNEIhJNXJKLmQ0EvgDcG84bcArwVFjlQeBL4ftzw3nC5aeG9c8FHnf3WndfDpQB48KpzN2XuftO4PGwrnQT5x09iC8fNYDqXfX86p0tbK6sjXdIItKEuXvHb9TsKeB/gd7A94FLgBnh0QlmNgh40d3HmNl8YKK7rw6XLQXGA7eE6zwSlt8HvBhuYqK7XxGWXwiMd/dro8RxFXAVQH5+ftGkSZPatD9VVVWkp7fuOYutras2W65XU9fAT6eVs7SijkNyk7nlczmkJDY/wGV32ne1qTbj1WY0xcXFpe5evMcCd+/QCTgbuCt8fxIwGcgjONporDMImB++nw8MjFi2NKx/J3BBRPl9wFfD6d6I8guBO/cWV1FRkbdVSUlJzOuqzb1bv7Xai26Z4kNumOzXPFrq9fUN+93mvtRVm2qzp7UZDVDiUb5T43Fa7DjgHDNbQXDK6hTgD0C2mTU+AmAgsCZ8v4Yg2RAuzwI2R5Y3Wae5culm+vdJ44cn9CUzNYnJH6zl9lc0grJIZ9HhycXdb3L3ge5eQHBB/jV3Px94neCoA+Bi4Lnw/fPhPOHy18Js+TxwXtibrBAYAcwCZgMjwt5nKeE2nu+AXZM4GJKVzJ/PH0tignHn62U8WbIq3iGJCJ3rPpcbgOvNrAzIJTjNRfiaG5ZfD9wI4O4fAk8CC4CXgGvcvd7d64BrgZcJeqM9GdaVbupzB/fj5+c29iCbx/SyTXGOSETi+iRKd58GTAvfLyPo6dW0Tg3wtWbW/wXwiyjlU4ApMQxVOrnzxw/h481V3PPmMr79SCnPXH0sw/v3jndYIj1WZzpyEdkvN04cyRmjD2BbTR2XPjCbTeqiLBI3Si7SbQRD9B/FEQOzWFVezZUPlWgMMpE4UXKRbqVXSiJ/u7iYAdm9mLtyC//95Ps0NHT8vVwiPZ2Si3Q7/Xuncf8lR9M7NYkX5q3lt/9eFO+QRHocJRfplg45sDd3XRB0Ub5r2lKmlO2gtk6nyEQ6Slx7i4m0pxNG9OO2L43hpmfmcd/c7Tz8wb8ZPaAPRw3qy5GDszlqUDYD+/YiGKpORGJJyUW6tW+OG0ztrnrufWMxq7fVMXflFuau3ALvBMvzMlM5anB2MA3qy+EDs+IbsEg3oeQi3d4lxxVyWFo5w0cdzgert4QJpoK5q7awqbKWVxas55UF6wEwg6zUBAa88xb9e6fSr3cq/Xun0b9PKv0yU+nfJ5jv1zs1znsl0rkpuUiPkdUrmRNG9OOEEf2AYNDWjzdXMXdVxe4jmoVrt7GlpoEtn2xjb8M6ZCQbg956kwOz0sjP6kV+Vlo49SI/O3ifnqI/MemZ9MmXHsvMKMjLoCAvgy8fNRCAXfUNvD69hAMKDmbj9lo2bK9lw/YaNmyv3T2/cVsNGytr2bHL+Wjddj5at73ZbWT1Suag7F4U93NGH15PWnJiR+2eSFwpuYhESE5MIDc9kSMGZbdYz915/d0S+g85mLVba1i7tTp43RK+bq1h3dYatlbvYmv1LhauhVc/nsb3TjuYr4wdQFKiOmpK96bkItIGZkZWagJjBmQxZkD0TgDuzuYdO3lv5RZue+49Vmyt4X+e/oB73lrGD844hNNHHaCeatJtKbmItBMzIy8zldNGHUBWVS6fJOXz238vomxDJf/5cCljB2dz45mHMq4wJ96hisScjs1FOkCCGeceOYCp15/ELV8cRW5GCnNWbuHrf32Xyx6YzUfrtsU7RJGY0pGLSAdKSUrgkuMK+WrxIO59axl/e3MZr320gdcXbeDLRw3gc/3qOKrBSUjQ6TLp2pRcROIgMzWJ7552MBdMGMKdr5Xx6MyPeWbOGp4BfjD1JQpzMyjMy6CwX/A6NC94zclI0XUa6RKUXETiKC8zlVvOGc1lxxXy+6mLeX3BWipqGli0fjuL1u/ZxblPWhKF/TIZlpfB8F41HHFkg3qeSaek5CLSCQzOTef2rx9JaWk9h4w5ghWbdrB0YyXLN+34dNq4g201dby/agvvr9oCwKMLpnH+hMGcd/RgcjJS4rwXIp9SchHpZDJTk6J2cXZ3NlXuZPmmHby3qoL731zCmi3V/OalRfz+1SV88fCDuOTYAg7T+GjSCSi5iHQRZka/cLyzcYU5HJVewY7eg3lw+gqmLd7I03NW8/Sc1Rw1OJtLji3gzDH5pCTplJnEh5KLSBeVYMZJh/TnpEP6s2LTDh6Z8TFPlKwKx0l7j1szF/KtcYMY1UvPsZGOp+Qi0g0U5GXw47NHcf3pB/OvuZ/w4PQVLFq/nT++VgbA4Hdf5+iCHMYV9uXoghwK8zLU60zalZKLSDeSnpLEt8YP5pvjBjFreTkPvfsxry1cx8ryKlaWV/H0nNVA0Evt6IK+YcLJ4dD8PiTq3hqJISUXkW7IzBg/NJfxQ3OZNbuE9INGMGt5ObNXlDNreTmbKmt5cf46Xpy/Dgg6ERQN6csJB9RRFOfYpXtQchHp5hITbHfvs8uOLxMcNLkAABM8SURBVMTdWbZpB7OXlzNrRZBwVpVX88bijbyxGEorSvnJ2aM4KLtXvEOXLkzJRaSHMTOG9ctkWL9Mzhs3GIB1W2t4du4a/vDKIl6cv443Fm/ku6eN4NLjCknWTZrSBvrUiAgHZqXxXycN4w8T85g4+kCqdtbzyykfcfYf32b2ivJ4hyddUIcnFzMbZGavm9kCM/vQzK4Ly3PM7BUzWxK+9g3Lzcz+aGZlZvaBmY2NaOvisP4SM7s4orzIzOaF6/zR1C1GpFXy0hP5y4VF/P2Soxmck86i9dv52l/e5Qf/fJ/NlbXxDk+6kHgcudQB/+3uo4AJwDVmNgq4EZjq7iOAqeE8wJnAiHC6CrgbgmQE3AyMB8YBNzcmpLDOlRHrTeyA/RLpNk4e2Z9/f+9EvnPqCFISE/hn6WpO+d0b/GPmShoaPN7hSRfQ4cnF3de6+5zw/XZgITAAOBd4MKz2IPCl8P25wEMemAFkm1k+cAbwiruXu3sF8AowMVzWx91nuLsDD0W0JSKtlJacyPWfP5iXv3ciJ4zIY2v1Ln747Dy+cvd05q6rZeHabayuqGJr1S7qlXCkCQu+f+O0cbMC4E1gDLDS3bPDcgMq3D3bzCYDv3L3t8NlU4EbgJOANHe/LSz/CVANTAvrnxaWnwDc4O5nR9n+VQRHQ+Tn5xdNmjSpTftRVVVFenp6TOuqTbXZmdp0d6avruHv722noqYh6rppSUZ6spGenEBGcvA+IzmBQZnO2AGZDMlOIrGFM9Sddd/VZsuKi4tL3b24aXnceouZWSbwNPBdd98WeVnE3d3M2j3rufs9wD0AxcXFXlTUth7+paWltHbd1tZVm2qzs7VZXAyXTtzFXdOW8vq8lXhSGttrdrG9po7KnXXU1Dk1dU559Z7J57GFtfRJS2JcYS7HDMvlmKG5jDyw92ceitaZ911t7ru4JBczSyZILI+6+zNh8Xozy3f3teGprQ1h+RpgUMTqA8OyNQRHL5Hl08LygVHqi8h+6p2WzA0TR3Javx2f+TJqaHAqd9ZRWVPH9pq63Uln4/ZaXipdwuKtsLqimlcXrufVhesByE5PZnxhDscMzeWYYXnE8yyKxF6HJ5fwlNd9wEJ3vz1i0fPAxcCvwtfnIsqvNbPHCS7ebw0T0MvALyMu4p8O3OTu5Wa2zcwmADOBi4A/tfuOifRgCQlGn7Rk+qQl77FsWMIGioqKWFVexbvLNjNj2WZmLN3MJ1trePnD9bz84frdde3pF4LXxvnwjMan80FZn2QYPPMd8rN6cUCfNPKz0jgwK3g9oE8waUTo+IrHkctxwIXAPDN7Lyz7IUFSedLMLgc+Br4eLpsCnAWUAVXApQBhErkVmB3W+7m7N3bIvxp4AOgFvBhOIhJHg3LSGZSTzteLB+HurCyv4t2lm3cnnPXbamk8eNl9DBP1aMbZVAebVm4BtjS7vbzMVEZkw92H7iQ7XQ9S62gdnlzCC/PNXdU7NUp9B65ppq37gfujlJcQdBIQkU7IzBiSm8GQ3IzdowTMLimhaGzR7sTSeJrs0/ngtb7Bee3dEvIGDWfdthrWba1h7dYa1m/79HX9tho2VdayqRL+4+7pPHDpOAbltO2CtbSNhn8RkU4hwewzF/ib/x8UDsxMomhobrPL6+obWLG5isvufYelG3fwlbun8/dLjt7j6Z7SfnRSUkS6naTEBIb3z+S2U3I4dlguG7fX8vW/vsvrizbsfWWJCSUXEem2MpITeODScXzpyIOo2lnPFQ+W8MTslfEOq0dQchGRbi0lKYE7vnEkV580jPoG54an53H7K4vV9bmdKbmISLdnZvzPxJH84stjSDD449Ql/OCpD9hVH320Adl/Si4i0mOcP34If7uomF7JiTxVuprLHpjN9ppd8Q6rW1JyEZEe5dRDD+DxqyaQm5HCW0s28Y2/zmD9tpp4h9XtKLmISI9zxKBsnrn6WArzMliwdhtfuWs6766uoWZXfbxD6zaUXESkRxqSm8HT/3UsYwdns2ZLNb99dwvFt73K9U++x7RFG3Q9Zj/pJkoR6bFyMlL4x5UTeGTGxzw2fQlLK+p4Zs4anpmzhpyMFM467EDOOWIAxUP6NrnBU/ZGyUVEerS05ESuOGEoR6VXkDNkJM+/9wnPv7+GpRt38MiMlTwyYyUHZaVx9hEHcc4RB6kLcyspuYiIhArzMrjutBF859ThLFi7jeff/4TJ769lzZZq7nlzGfe8uYzMZKPva6+RkZJEZmoSGalJZKYlkZkSvk9NJDMtiT5pySRu28WRDU5iDzzqUXIREWnCzBh9UBajD8rihjNGMmdlBc+99wlT5q1l846dVJZXt7qtn7/1b4oL+jJ+aC7jC3MYMyCL5MTuf7lbyUVEpAUJCUZxQQ7FBTnccs5opr07m2GHjKayto4dtXXs2FlHZW09O2qDh6U1lm/esZN3Fq9jw446Xl+0kdcXbQQgPSWRoiF9GV+Yw7jCXI4Y1D0H01RyERFppcQEIzstkYK8jFbVLy0tJX/YKGYtL2fm8nJmLt/Mso07eGvJJt5asgkIhqcpzErkiOXvM7x/JsP6ZTK8fyYD+6Z36dNpSi4iIu3ooOxefOmoAXzpqAEAbNhew+zlFcxcvpmZy8pZtH47izY3sGjz6s+sl5KUwNC8DIZFJJyheRmsq6xjw7YaUpMT6ZWcSHKi7X5iZ2ei5CIi0oH6907jC4fn84XD8wGo2LGT594sITH7IMo2VLJ04w7KNlSyblsNH63bzkfrtu/ZyItTd79NsKDHW1pyImlJCaSlJJKWlEjDzmoO/GAWvdOSyUxNonda0AEhM+yA0LuxI0JqEuXVsb95VMlFRCSO+makcFj/VIqKCj5TXllbx9INlWHCCV4/3lxFxfYqGhKSqN1VT/WueuoanKqd9VTt3DNBfLR5Y6tiOOfgdD5/fCz25lNKLiIinVBmahJHDMrmiEHZnykvLS2lqKho93xdfQM1dQ3U7KoPp+D93HkLGFAwlO1hJ4PGzgbbd7/u2l1+QEbs791RchER6cKSEhPITEwgM/WzX+e161IoGnlAq9ooLS2NeVzdv7O1iIh0OCUXERGJOSUXERGJOSUXERGJOSUXERGJOSUXERGJOSUXERGJOSUXERGJOdNT1QJmthH4uI2r5wGbYlxXbapNtak2O1ub0Qxx9357lLq7pv2cgJJY11WbalNtqs3O1ua+TDotJiIiMafkIiIiMafkEhv3tENdtak21aba7Gxttpou6IuISMzpyEVERGJOyUVERGJOyUVERGJOT6LcD2Z2BHBCOPuWu78fpY4B5wND3f3nZjYYONDdZzXTZl9gBJDWWObub0ap93dgjwtm7n5Zk3p/jLYdd/9Ok3q3uPst0epG2faDwHXuviUi5t813XaU9QYAfd19fpRlrYozrJsGXA0cT/AzeBu4291rmm7P3dc0Kfu2u/+lSdn9zWy7xf2JWP9sIAd4w92j3ohrZvlAubvXRlnWqu2b2f8DHnH3ilbENLiZNle2sM4IINndFzSzPKeZNsub1HvA3S/ZW4xh3eMIfpd3EvydjAZ+5O7TW7N+0/gaYzGzc4ATw0VvuPukKPVz3X1zk7Lz3f3RKHVb/fnch3hHNdPmgqb1opSd5O7TWmj7v9z97rbGFgtKLm1kZtcBVwLPhEWPmNk97v6nJlXvAhqAU4CfA9uBp4Gjo7R5BXAdMBB4D5gAvBuu29Tk8PU3wP+0EOoXwm3eBezxxRbhHOCWFpZHOrwxsQC4e4WZHRWtopn9H3Ax8AfgW0Cdmb3m7t9rUvVc4Ket3P5DBPvU+LP+FvAw8LUm9V4ws/Pc/SMzO4SgR8zCKO2dQTA6w8PAhpY2bGbPNy0iSHLn0/LP92FgmJk97e7fb7LsJOAHLW03dAAw28zmAPcDL3vzPXI+AsrC+IYCywgS8eHRKpvZDwl+TzvM7I0ovx+A0rANA/KBteH80Cb1om6jGXcSfDYnAdcAU8OysU3i2x6x7d2v7t4noto04HAz+xXB31djkviOmU1w9x812fZLZnaBuy8ys5HAn4ElEethZte5+x+As4Ft7P3vKCozO9Dd1zUp/muUqmOAvk3KnjSzhwn+1tPC12LgmBY2eSUQ1+QS0zsye9IEfABkRMxnAB9EqTcnfJ0bUfZ+M23OI/jwvBfOjwSe2Uscc/eyPIngj3Y2cAWQ0Ey91cD1Tadm6r5PcATSOJ8DzGumbhnQH6gI9y0R+LC5n1Mrf/YLWll2KDAXuIMgWZ/YTHsJwFnAY8CTwJktbHsJ8LmI6SRgSSvjNmD0vv4Oo7RxBvB4+LP9JTCspTYbP097aXcO0Ctsf3Yr6jcb8z7+LkvD10X78Jk+uJmf44zw9YPIz3n4mdvjby5sZzbBl/y7wLgodWaGr636O2oh5hdaWe+tKGUZBAn3XWA+cNPetr8vv4P2mnTNpe0MqI+Yrw/LmtplZomEp7DMrB/BkUw0NR6e2jGzVHf/CDhkL3G02Jfc3evc/c8EX4T9gOlm9tUoVROBTKB3kyma3wHvmtmtZnYrMJ3gv6lotrn7BmCFu9e4ez3R//MbaWbvmdkMM3vGzP47PP0VzRwzm9A4Y2bjgZKmldx9IcGR2ynA/3qU04thvQZ3nwLcClQB1zazXYDt7v5GxDSN4ChqrzzwYbRFrVm/sQ1gXTjVEfyX+5SZNf359zKzFDPLBoaY2QMt/Dwb264O269uqZ6ZpQApLVQZaWYfREzzzOyDZuo2/g19PWw7gRauBZvZTQRHOQ+b2R1NFi82s8ZTYdkR5VlEOUvj7osJEvUhwMMe/VT19rBua/+OonL3L7S2apSyXQS/k14E/6Atd/fmvkMaRfsu6lC6z6WNzOx6gtMIz4ZFXwIecPffN6l3PvANgsP8B4GvAj92939GafNZ4FLguwRfiBUE58DPilJ3HsEHcTifnv5wdz+8mXqEdbKAAe6e2KTeXHePemqrmf0fxaen617z5s/TV4XxRcY51N0zmtQbQpDgegEHEZziSnf3C6K0uZDgC6Hx+sFgYBHBl+3un0HEvvcmONW4kKBC05/RVQS/vzLg7+4+t4X9rgVWAeUER3uTge+7e9Tz560R8TPaXUT03+V1wEUEAwzeC/zL3XeFX8hL3H1YRN2f8GmSvJngdN9/u/txTdqM9jkqcPc9/rEws8brFocC/3D3qKcxw9/lHjzK9Sgzy3P3TRHzmQRHJTObaXsOcBxQQ3CEVRyxbBDB0WfjZ+jlcH9OBm5x9/ubtNV4qi2R4Eu7iian2sysv7tvaO3f0b6I2P7uIiDN3ZOb1HsfeI7gn5884C/ATndveho4cp2r3f2utsYWC0ou+8HMxhKcb4fgcDbql1J4PvdUgg/P1PA/6r21/TmCD/BL7r4zyvJW/QHvQ73fuHtL127aZF++aKKs+2t3v6G1bTZtex/2vYHgi7WWiD/2pl/uYd1cPj3KKyRIglcSfIEtiPyibK19iPNnwP3NfEkf2vRzZWa9w3a2h/PD3H1pW7Yd1v0cwVH3andf3uJOtRMzm+PuY8P3b7r7iU2WpxL809OP4O9tG8Gpt6gdGcKf0RPA6+7+fy1st82f431hZm+5+wlNyordvaRJ2YXu/nAstx1rSi7S4+3vF4eZXU3wZfbP5o7gZP/syxHWPrQ5iOAa23KCDgrXxPv3Fy1hdlVKLiLS6bXHkUN4iu1Kdy81s2KCHo3T3b01Pff2W5NTbRCDhNmZKLmISI9kZgPdfXXEvAHf9g66P6SjTrXFi5KLiIjEnLoii4hIzCm5iIhIzCm5iLQDM/uRmX0Y3kT4XnijZ3tta1p4QVqk09DYYiIxZmbHEIxFNdbda80sj5bvaBfpdnTkIhJ7+cAmD0dAdvdN7v6Jmf3UzGab2XwzuyfsndR45HGHmZWY2UIzOzocAmeJmd0W1ikws4/M7NGwzlNmlt50w2Z2upm9a2ZzzOyf4R3vmNmvzGxBeCT12w78WUgPpeQiEnv/BgaZ2WIzuyu8sx3gTnc/2t3HEAxRcnbEOjvDoUz+QjDUxzUEI+ReEo4KAMGQN3e5+6EEd55fHbnR8Ajpx8Bp4V3sJcD14fpfJhhW5XDgtnbYZ5HPUHIRiTF3rwSKgKuAjcATZnYJcLKZzQxvnjuF4NkljRqH8p9HMGr02vDIZxkwKFy2yt3fCd8/wqdDDzWaAIwC3jGz9wjGvhsCbCUYi+s+M/sKwRhaIu1K11xE2kE4+vM0YFqYTP6T4Dknxe6+ysxuIeKBcHw6UnQDnx01uoFP/06b3pTWdN6AV9z9m03jMbNxBOPbfZVgQMtozwgSiRkduYjEmJkdYsFTHRsdSTBqM8Cm8DpIq4drjzA47CwAwQPS3m6yfAZwnJkND+PIMLODw+1lhY8V+B5wRBu2LbJPdOQiEnuZwJ8seJZKHcFAi1cBWwge9rSO4KFT+2oRcI0Fj0VeQJMnDbr7xvD022Ph6MAQXIPZDjxnwfNcjOBBcCLtSsO/iHQBZlYATA47A4h0ejotJiIiMacjFxERiTkduYiISMwpuYiISMwpuYiISMwpuYiISMwpuYiISMz9f2w+uv5/eobqAAAAAElFTkSuQmCC\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "needs_background": "light"
          }
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<matplotlib.axes._subplots.AxesSubplot at 0x7f72181b91d0>"
            ]
          },
          "metadata": {},
          "execution_count": 39
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Полезные ссылки\n",
        "\n",
        "* https://habr.com/ru/company/Voximplant/blog/446738/\n",
        "* https://www.nltk.org/\n",
        "* [Статья на Medium](https://medium.com/@bigdataschool/%D0%BF%D1%80%D0%B5%D0%B4%D0%BE%D0%B1%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D0%BA%D0%B0-%D1%82%D0%B5%D0%BA%D1%81%D1%82%D0%B0-%D0%B2-nlp-82c164bb7416)\n",
        "* https://nuancesprog.ru/p/5870/\n",
        "* https://habr.com/ru/post/517410/"
      ],
      "metadata": {
        "id": "gfI48znGB64y"
      }
    }
  ]
}